BIOML Proposal, 19990220
The Biopolymer Markup Language—BIOML
Working Draft Proposal
1. Introduction to bioinformatics
TOC
1.1 Bioinformatics?
1.2 Proteins
1.2.1 What are proteins?
1.2.2 Protein sequence databases
1.2.3 Database redundancy
1.2.4 Annotation
1.3 Genes
1.3.1 What are genes?
1.3.2 Nucleotide sequence databases

1.1 Bioinformatics? Bioinformatics is a vague, general term used to encompass the use of applied mathematics to understand experimentally determined oligonucleotide and peptide sequences. The rapid increase in the size of the world's collection of these sequences has meant that manual methods are inadequate for the extraction information or patterns from this data. Therefore, this sequence data must be interpreted by machines into a form that is compact enough for humans to use it effectively. Performing this task requires a sophistocated understanding of computer database technology and some intuition about the biological questions that might be addressed by finding patterns in the data stored in these databases. Because the skill sets required to perform this type of directed analysis are not often found in the same individual, the act of "bioinformatics" is often carried out in teams of biological researchers and computational mathematicians working together to solve a problem. This type of collaborative effort is still in its infancy: it is not a recognized discipline at the majority of universities, although it is a skill that is currently in very high demand in industry.
1.2 Proteins
What are proteins?
Proteins are directly responsible for almost all of the metabolic processes that occur within a cell. Specialized proteins are used for the generation of structural elements in a cell or an entire organism. The elaborate structure of DNA in a cell, now referred to as its genome, is meant to record the structure of an organism’s proteins, both to serve as a template for the construction of copies of those proteins and to pass that crucial information on to the next generation. The proteins encoded by a genome can be organized into larger structures that are used as cellular machinery for performing specific tasks in a cell, such as transcribing DNA into mRNA into more protein (which requires several discrete protein-based machines), transporting other molecules through membranes or automatically repairing damage to cellular subsystems. When genetic material is passed from one generation to the next, it is also necessary to pass a complete working set of protein-based machinery for the reading of the DNA strands and performing all other necessary steps in cell metabolism.
Individual proteins are composed of discrete polypeptide chains, called subunits, that may be linked together either covalently or by the local structure of the surrounding solvent molecules. The linear sequence of individual amino acid residues in a particular polypeptide chain is referred to as the "primary structure" of the polypeptide. This primary structure is what is directly encoded by an organism’s genome. Also encoded in the genome are the instructions to make protein-based machines for the express purpose of modifying particular residues in other proteins. These modifications occur after a polypeptide has been "translated" from the original nucleotide instructions: therefore they are called "post-translational modifications".
1.2.2 Protein sequence databases Databases have become the preferred method for storing both polypeptide amino acid sequences and the nucleotide sequences that code for these polypeptides. The databases come in a variety of different types, that have advantages and disadvantages when viewed as the starting place for designing experiments and investigations.
While the "database entry" for an amino acid sequence may appear to be a simple text file to a user browsing for a particular polypeptide, many databases are being organized into very flexible, complicated structures. The detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file" database), a collection of tables ( a "relational" database) or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented" database). The organization of the database is of more concern to programmers than it is to the user and it does not directly affect the usefulness of the database for protein analysis and identification.
Any protein sequence database contains a collection of amino acid sequences represented by a string of single letter codes for the residues in a polypeptide, starting at the N-terminus of the sequence. The letter codes may be either upper case or lower case, depending on the database interface. These codes may contain non-standard characters to indicate ambiguity at a particular site (such as "B" indicating that the residue may be "D" or "N"). All of the sequences have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence (more technically referred to as a "key attribute" for the database entry).
1.2.3 Redundancy Sequence databases may contain multiple copies of a particular sequence (redundant entries) or it may have been constructed so that there are no multiple copies, i.e., it is non-redundant. The presence of redundant sequences in a database may have several origins. The cause may be a historical accident, such as the entry of the same sequence under more than one "protein" name because of confusion about the function of the protein, or it may be deliberate, for example multiple entries for a polypeptide sequences that has been spliced together after translation. The presence of many copies of the same sequence in a database has the effect of making the database larger, resulting in slower searches. It also may result in multiple hits in a protein identification experiment, while all of the hits are actually the same molecule. Database managers are currently engaged in the process of removing as many redundant entries as they can find from their databases. Some databases exist only to be non-redundant collections of sequence entries from other redundant databases.
1.2.4 Annotation Databases may contain a combination of amino acid sequences, comments, literature references and notes on known post-translational modifications to the sequence. A database that contains all of these elements is referred to as "annotated". Other databases only contain the sequence, an accession number and a descriptive title. Annotation of each entry is obviously very time-consuming and difficult to maintain without errors. Therefore annotated databases usually have many fewer sequence entries than non-annotated ones. Annotation also implies that some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleotide sequence. Even the best annotated databases now include large numbers of entries that have very little real information about the mature protein other than some reference to who sequenced and translated the nucleotide sequence.
Annotated databases are technically superior for many purposes, because they contain information about the true form of the mature protein. The nucleotide sequence of many polypeptides contain stretches of sequence that are removed either immediately following translation of the pre-protein (signal peptides) or when an inactive pro-protein is activated by the removal of the corresponding pro-peptide. This type of information can only be conveyed through an annotated database.
Non-annotated database have the tremendous advantage of being simpler to maintain. This simplicity means that new sequences are incorporated more quickly and the effort necessary to verify all new entries is not required. Several very large amino acid sequence databases (GENPEPT and TREMBL) are simply translations of corresponding large nucleotide databases (GENBANK and EMBL). The number of entries in these translated databases makes them very attractive for many purposes because the chance of finding information about a particular protein is much greater than in the much smaller annotated databases. It is necessary to be much more careful about interpreting the results from searching this type of database because no information about the mature polypeptide is available. Even with this caveat, the large number of polypeptide sequences available in non-annotated databases and the current ascendancy of molecular biology has resulted in their widespread use.
1.3 Genes
What are genes?
Genes are working subunits of deoxyribonucleic acid (DNA). DNA is a vast chemical information database that carries the complete set of instructions for making all the proteins a cell will ever need. Each gene contains a particular set of instructions, usually coding for a particular protein.
DNA exists as two long, paired strands spiraled into the famous double helix. Each strand is made up of millions of chemical building blocks called bases. While there are only four different chemical bases in DNA (adenine, thymine, cytosine, and guanine), the order in which the bases occur determines the information available, much as specific letters of the alphabet combine to form words and sentences.
DNA resides in the core, or nucleus, of each of the body's trillions of cells. Every human cell (with the exception of mature red blood cells, which have no nucleus) contains the same DNA. Each nucleated human cell has 46 molecules of double-stranded DNA. Each molecule is made up of 50 to 250 million bases housed in a chromosome.
The DNA in each chromosome constitutes many genes (as well as vast stretches of noncoding DNA, the function of which is unknown). A gene is any given segment along the DNA that encodes instructions that allow a cell to produce a specific product - typically, a protein such as an enzyme - that initiates one specific action. There are between 50,000 and 100,000 human genes, and every gene is made up of thousands, of nculeic acid bases.
1.3.2 Nucleic acid sequence databases The majority of information about the primary structure of proteins is currently available in the form of nucleotide sequences that have been determined either from DNA or messenger RNA. DNA and RNA sequences are stored in the form of long sequences of single letter abbreviations for the individual nucleotides that make up the linear chain of the nucleic acid. By reading these letters three at a time (the cluster of three letters is called a "codon"), it is possible to translate this code into the amino acid sequence that will be created when the coded mRNA is read in the cell by ribosomes (the mRNA decoding devices in all cells that are themselved made up of protein and RNA).
Sequences encoded in DNA contain long stretches of nucleotides that do not code for any amino acids in the final polypeptide product. Before the region that encodes a polypeptide chain, there are several regions that affect when a region of DNA will be recorded onto mRNA, which can then be read by a ribosome. There is then a "start" signal in the DNA, represented by a "start codon". After the "start" of a region of DNA that does code for a polypeptide, there may be regions that will be edited out of the final mRNA message: these regions are called "introns". The remaining regions that do code for polypeptides are called "exons". There is an end to the final "intron" in a coding region, that is refered to as the "stop" codon. Following the stop codon there is more DNA, that may have some regulatory effect on the translation of the DNA into RNA.
This DNA and RNA sequence information is currently stored in a number of very large databases, located around the world. The databases that contains all of the nucleotide sequence information about humans and the considerable effort necessary to accumulate that information is called the "Human Genome Project". Many other projects of this type are currently underway. The complete nucleotide sequence information from several whole organisms has been compiled: organisms with completely known nucleotide sequences are referred to as having "known genomes". Much effort is currently being put into determining only the DNA that actually codes for protein by sequencing RNA exclusively. When the complete set of RNA for an organism is known, it is referred to as having a "known proteome". The purpose of BIOML is to facilitate the exchange of information about DNA, RNA, polypeptide and protein structure in such a way that researchers can fully express these concepts between machines and between each other without having to develop a new data file format everytime they wish to express a new idea about these complicated chemical entities.

0. Title and abstract TOC 2. BIOML Fundamentals