1.1 Bioinformatics? |
Bioinformatics is a vague, general term used to encompass the use of applied
mathematics to understand experimentally determined oligonucleotide and peptide
sequences. The rapid increase in the size of the world's collection of these
sequences has meant that manual methods are inadequate for the extraction
information or patterns from this data. Therefore, this sequence data must be
interpreted by machines into a form that is compact enough for humans to use it
effectively. Performing this task requires a sophistocated understanding of
computer database technology and some intuition about the biological questions
that might be addressed by finding patterns in the data stored in these
databases. Because the skill sets required to perform this type of directed
analysis are not often found in the same individual, the act of
"bioinformatics" is often carried out in teams of biological researchers and
computational mathematicians working together to solve a problem. This type of
collaborative effort is still in its infancy: it is not a recognized discipline
at the majority of universities, although it is a skill that is currently in
very high demand in industry.
|
1.2 Proteins
What are proteins? |
Proteins are directly responsible for almost all of the metabolic processes
that occur within a cell. Specialized proteins are used for the generation of
structural elements in a cell or an entire organism. The elaborate structure of
DNA in a cell, now referred to as its genome, is meant to record the structure
of an organism’s proteins, both to serve as a template for the construction of
copies of those proteins and to pass that crucial information on to the next
generation. The proteins encoded by a genome can be organized into larger
structures that are used as cellular machinery for performing specific tasks in
a cell, such as transcribing DNA into mRNA into more protein (which requires
several discrete protein-based machines), transporting other molecules through
membranes or automatically repairing damage to cellular subsystems. When
genetic material is passed from one generation to the next, it is also
necessary to pass a complete working set of protein-based machinery for the
reading of the DNA strands and performing all other necessary steps in cell
metabolism.
|
|
Individual proteins are composed of discrete polypeptide chains, called
subunits, that may be linked together either covalently or by the local
structure of the surrounding solvent molecules. The linear sequence of
individual amino acid residues in a particular polypeptide chain is referred to
as the "primary structure" of the polypeptide. This primary structure is what
is directly encoded by an organism’s genome. Also encoded in the genome are the
instructions to make protein-based machines for the express purpose of
modifying particular residues in other proteins. These modifications occur
after a polypeptide has been "translated" from the original nucleotide
instructions: therefore they are called "post-translational modifications".
|
1.2.2 Protein sequence databases |
Databases have become the preferred method for storing both polypeptide amino
acid sequences and the nucleotide sequences that code for these polypeptides.
The databases come in a variety of different types, that have advantages and
disadvantages when viewed as the starting place for designing experiments and
investigations.
|
|
While the "database entry" for an amino acid sequence may appear to be a simple
text file to a user browsing for a particular polypeptide, many databases are
being organized into very flexible, complicated structures. The detailed
implementation of the database on a particular system may be based on a
collection of simple text files (a "flat-file" database), a collection of
tables ( a "relational" database) or it may be organized around concepts that
stem from the idea of a protein, gene, or organism (an "object-oriented"
database). The organization of the database is of more concern to programmers
than it is to the user and it does not directly affect the usefulness of the
database for protein analysis and identification.
|
|
Any protein sequence database contains a collection of amino acid sequences
represented by a string of single letter codes for the residues in a
polypeptide, starting at the N-terminus of the sequence. The letter codes may
be either upper case or lower case, depending on the database interface. These
codes may contain non-standard characters to indicate ambiguity at a particular
site (such as "B" indicating that the residue may be "D" or "N"). All of the
sequences have a unique number-letter combination associated with them that is
used internally by the database to identify the sequence, usually referred to
as the accession number for the sequence (more technically referred to as a
"key attribute" for the database entry).
|
1.2.3 Redundancy |
Sequence databases may contain multiple copies of a particular sequence
(redundant entries) or it may have been constructed so that there are no
multiple copies, i.e., it is non-redundant. The presence of redundant sequences
in a database may have several origins. The cause may be a historical accident,
such as the entry of the same sequence under more than one "protein" name
because of confusion about the function of the protein, or it may be
deliberate, for example multiple entries for a polypeptide sequences that has
been spliced together after translation. The presence of many copies of the
same sequence in a database has the effect of making the database larger,
resulting in slower searches. It also may result in multiple hits in a protein
identification experiment, while all of the hits are actually the same
molecule. Database managers are currently engaged in the process of removing as
many redundant entries as they can find from their databases. Some databases
exist only to be non-redundant collections of sequence entries from other
redundant databases.
|
1.2.4 Annotation |
Databases may contain a combination of amino acid sequences, comments,
literature references and notes on known post-translational modifications to
the sequence. A database that contains all of these elements is referred to as
"annotated". Other databases only contain the sequence, an accession number and
a descriptive title. Annotation of each entry is obviously very time-consuming
and difficult to maintain without errors. Therefore annotated databases usually
have many fewer sequence entries than non-annotated ones. Annotation also
implies that some functional or structural information is known about the
mature protein, as opposed to a sequence that is known only from the
translation of a stretch of nucleotide sequence. Even the best annotated
databases now include large numbers of entries that have very little real
information about the mature protein other than some reference to who sequenced
and translated the nucleotide sequence.
|
|
Annotated databases are technically superior for many purposes, because they
contain information about the true form of the mature protein. The nucleotide
sequence of many polypeptides contain stretches of sequence that are removed
either immediately following translation of the pre-protein (signal peptides)
or when an inactive pro-protein is activated by the removal of the
corresponding pro-peptide. This type of information can only be conveyed
through an annotated database.
|
|
Non-annotated database have the tremendous advantage of being simpler to
maintain. This simplicity means that new sequences are incorporated more
quickly and the effort necessary to verify all new entries is not required.
Several very large amino acid sequence databases (GENPEPT and TREMBL) are
simply translations of corresponding large nucleotide databases (GENBANK and
EMBL). The number of entries in these translated databases makes them very
attractive for many purposes because the chance of finding information about a
particular protein is much greater than in the much smaller annotated
databases. It is necessary to be much more careful about interpreting the
results from searching this type of database because no information about the
mature polypeptide is available. Even with this caveat, the large number of
polypeptide sequences available in non-annotated databases and the current
ascendancy of molecular biology has resulted in their widespread use.
|
1.3 Genes
What are genes? |
Genes are working subunits of deoxyribonucleic acid (DNA). DNA is a vast
chemical information database that carries the complete set of instructions for
making all the proteins a cell will ever need. Each gene contains a particular
set of instructions, usually coding for a particular protein.
|
|
DNA exists as two long, paired strands spiraled into the famous double helix.
Each strand is made up of millions of chemical building blocks called bases.
While there are only four different chemical bases in DNA (adenine, thymine,
cytosine, and guanine), the order in which the bases occur determines the
information available, much as specific letters of the alphabet combine to form
words and sentences.
|
|
DNA resides in the core, or nucleus, of each of the body's trillions of cells.
Every human cell (with the exception of mature red blood cells, which have no
nucleus) contains the same DNA. Each nucleated human cell has 46 molecules of
double-stranded DNA. Each molecule is made up of 50 to 250 million bases housed
in a chromosome.
|
|
The DNA in each chromosome constitutes many genes (as well as vast stretches of
noncoding DNA, the function of which is unknown). A gene is any given segment
along the DNA that encodes instructions that allow a cell to produce a specific
product - typically, a protein such as an enzyme - that initiates one specific
action. There are between 50,000 and 100,000 human genes, and every gene is
made up of thousands, of nculeic acid bases.
|
1.3.2 Nucleic acid
sequence databases |
The majority of information about the primary structure of proteins is
currently available in the form of nucleotide sequences that have been
determined either from DNA or messenger RNA. DNA and RNA sequences are stored
in the form of long sequences of single letter abbreviations for the individual
nucleotides that make up the linear chain of the nucleic acid. By reading these
letters three at a time (the cluster of three letters is called a "codon"), it
is possible to translate this code into the amino acid sequence that will be
created when the coded mRNA is read in the cell by ribosomes (the mRNA decoding
devices in all cells that are themselved made up of protein and RNA).
|
|
Sequences encoded in DNA contain long stretches of nucleotides that do not code
for any amino acids in the final polypeptide product. Before the region that
encodes a polypeptide chain, there are several regions that affect when a
region of DNA will be recorded onto mRNA, which can then be read by a ribosome.
There is then a "start" signal in the DNA, represented by a "start codon".
After the "start" of a region of DNA that does code for a polypeptide, there
may be regions that will be edited out of the final mRNA message: these regions
are called "introns". The remaining regions that do code for polypeptides are
called "exons". There is an end to the final "intron" in a coding region, that
is refered to as the "stop" codon. Following the stop codon there is more DNA,
that may have some regulatory effect on the translation of the DNA into RNA.
|
|
This DNA and RNA sequence information is currently stored in a number of very
large databases, located around the world. The databases that contains all of
the nucleotide sequence information about humans and the considerable effort
necessary to accumulate that information is called the "Human Genome Project".
Many other projects of this type are currently underway. The complete
nucleotide sequence information from several whole organisms has been compiled:
organisms with completely known nucleotide sequences are referred to as having
"known genomes". Much effort is currently being put into determining only the
DNA that actually codes for protein by sequencing RNA exclusively. When the
complete set of RNA for an organism is known, it is referred to as having a
"known proteome". The purpose of BIOML is to facilitate the exchange of
information about DNA, RNA, polypeptide and protein structure in such a way
that researchers can fully express these concepts between machines and between
each other without having to develop a new data file format everytime they wish
to express a new idea about these complicated chemical entities.
|
|
0. Title and abstract
|
TOC
|
2. BIOML Fundamentals
|
|