3.1 Introduction |
BIOML is an application of XML, therefore this proposal uses the terminology of
XML to describe it. Briefly, XML data is composed of Unicode characters (which
include ordinary ASCII characters), "entity references" (informally called
"entities") such as "&" which usually represent "extended characters"
such as the ampersand, and "elements" such as
<note id="123">Hi</note>.
Elements enclose other XML data called their "content" between a "begin tag"
and an "end tag" much like in HTML. There are also "empty elements" such as <file/>,
whose begin tag ends with /> to indicate that the element has no
content or end tag. The begin tag can contain named parameters called
"attributes", such as id="123" in the example above. For further details
on XML, consult the XML specification.
|
|
Because XML is case-sensitive, BIOML element and attribute names are
case-sensitive and all lowercase. Note, however, that all BIOML element and
attribute names consist solely of ASCII characters, for which case
insensitivity is trivially well-defined, and do not need to be distinguished by
case.
|
|
In formal discussions of XML markup a distinction is maintained between an
element, such as an protein element, and the tags <protein> and </protein>
marking it. What is between the <protein> begin tag and the </protein>
end tag is the protein element's content. An "empty element" (e.g.,
file) is defined to have no content and so has a single tag of the form <file/>.
Usually, the distinction between elements and tags will not be so finely drawn
in this specification. For instance, we will sometimes refer to the <protein>
and <file> elements, really meaning the elements that use these
tags. Using the tags as references to elements makes them visually
distinguishable from references to attributes. However, the words "element" and
"tag" themselves will be used strictly in accordance with XML terminology.
|
3.2 Gene-specific elements
Summary |
A "gene" is the concept that lead to the modern science of genetics. It was
originally thought of as an invisible element of information that was
distributed from parents to children during the process of reproduction. It is
now known that these elements are specific stretches of DNA, found organized
into larger structures called chromosomes. It is the sequence of
oligonucleotides in linear DNA polymers that defines the polypeptide molecule
that will be constructed when that stretch of DNA undergoes transcription.
Adjacent portions of DNA determine when a particular piece of DNA will be
transcribed, although the entire mechanism that turns on and off transcription
to produce a differentiated cell is not clearly understood. When DNA is
transcribed, it is not used to directly construct a polypeptide. Instead, it is
used to construct a messenger RNA molecule (mRNA). In prokaryotic organisms–e.g.,
all bacteria—the mRNA is then transported to the apparatus that will read it to
make a specific polypeptide. In eukaryotic organisms—e.g., all animals
and plants—the mRNA is frequently edited before it leaves the nucleus, removing
sections that do not code for a polypeptide (removing the "introns"). The
individual nucleotide residues do not code for a particular amino acid in the
final polypeptide: the nucleotides are read as triplets, with a redundant code
table providing the translation to amino acids.
|
|
The following elements are proposed to express the idea and physical reality of
a gene as a highly ordered piece of DNA located in a chromosome. Some
additional elements are included to specify the related object, a messenger RNA
molecule.
|
Highest level elements |
These elements describe the location of a particular piece of DNA within an
organism's compliment of chromosomes. The equivalent for prokaryotic organisms
is the location on a specific plasmid.
Element |
Attributes |
Functions |
chromosome |
number |
Encloses a chromosome |
sts_domain |
start
end
|
Encloses a region of a chromosome delimited by two
Sequence Tagged Sites. |
locus |
start
end
|
Encloses a locus description |
clone |
— |
Encloses a clone description |
plasmid |
— |
Encloses a plasmid description |
|
DNA elements |
These elements describe the actual stretch of DNA in the vicinity of a region
that encodes for a protein, i.e., a gene. Upstream from the gene are regions
that regulate its transcription, such as a promotor region.
Element |
Attributes |
Functions |
dna |
start
end |
Encloses an oligonucleotide composed of DNA |
promotor |
start
end |
Encloses a promotor |
gene |
comp |
Encloses a gene |
exon |
start
end
type |
Encloses an exon |
intron |
start
end |
Encloses an intron |
ddomain |
type
start
end |
Encloses a DNA domain |
da |
type |
A deoxyribonucleotide residue |
dmod |
atbr>type
occ |
A deoxyribonucleotide modification |
dvariant |
at
type
occ |
A deoxyribonucleotide variant at a particular site |
dstart |
at |
A start codon |
dstop |
at |
A stop codon |
|
RNA elements |
RNA is responsible for moving the code for a protein out of the nucleus and
into the endoplasmic reticulum, where it is read by ribosomes. The structure of
RNA is simpler that the original DNA: a significant amount of editing has
already occured.
Element |
Attributes |
Functions |
rna |
start
end |
Encloses an oligonucleotide composed of RNA |
rdomain |
type
start
end |
Encloses an RNA domain |
ra |
type
at
type |
A ribonucleotide residue |
rmod |
at
type
occ |
A ribonucleotide modification |
rvariant |
at
type
occ |
A ribonucleotide variant at a particular site |
rstart |
at |
A start codon |
rstop |
at |
A stop codon |
|
3.2.2 A simple <gene> example |
A very simple example of a gene which is composed of a very short
oligonucleotide sequence is as follows (this was taken from the sequence of the Drosophila
melanogaster gene for ubiquitin).
|
|
Comments |
BIOML |
start BIOML
start this gene
start DNA strand
a stretch of DNA
a start codon
a stretch of DNA
end codon
a stretch of DNA
end DNA strand
end this gene
end BIOML
|
<bioml>
<gene>
<dna start="1" end="41">
GCAGCGACGA CC
<dstart at="13">ATG
</dstart>
TCCGG CGCCACCGAG
<dend at="30">TAG
</dend>
TCGGGCT C
</dna>
</gene>
</bioml>
|
|
|
In this example, the letters "G", "C", "A" and "T" have their normal meanings
as individual dna nucleotides (they can be either lower or upper case). White
space (spaces, tabs, carriage returns and linefeed characters) are ignored by
the parser, but they can be freely added to aid the flow and readability of the
file. The parser also should ignore any character that cannot be a nucleotide
residue, allowing the author to include numbers and other symbols that make
reading the file easier. The <bioml> tags indicate that the elements
contained within the tags are to be interpreted as BIOML elements.
|
|
This example demonstrates BIOML's construction and flow of logical connection
between elements. The <nstart> element's connection to the other
elements is determined by following enclosing elements travelling upward
through the nested statements:
|
<dstart> belongs to
<dna> belongs to
<gene>. |
Similarly, the following statements are logically correct:
|
<dend> belongs to
<dna> belongs to
<gene>. |
However, the statement "<dstart> belongs to <dend>"
is not true, because the <dend> tags do not enclose the <dstart>
tags. Both of these elements belong to the <dna> element,
but they are separate from each other.
|
|
This example is not complete, however. In practice, one would want to add in
many details– that is the whole point of BIOML. These details would be the
identity of the organism, literature references for this gene, notes on the
function of the gene, its placement in a chromosome, et cetera.
Section 3.4 explains how this can be done.
|
A further note on the <gene> element |
Genes are fundamental to describing how chromosomes work. The current method
for locating a gene on a chromosome is to first identify a region of DNA on a
chromosome that contains the gene. This large section of DNA is called a
"locus". The locus will contain all of the DNA necessary to code for a gene,
but it also contains regulatory DNA and other regions of DNA that are of
unknown function. The logic that has been chosen to describe a gene within a
locus is as follows:
|
<locus> contains
<gene> contains
<dna> contains
<exon> and <intron> and <ddomain>. |
We would like to recommend that the DNA domain elements — <exon/>, <intron/>
and <ddomain/> — should be used as empty elements. Rather than
attempting to enclose the appropriate sections of a complicated locus with the
start and end tags, placing the domains at the beginning of a <gene>
element makes editting and reading the file much easier. If you need to attach
annotation to one of these domains, then enclose the annotation with the
appropriate domain start and end tags. Take a look at the
insulin example to see how this type of scheme works.
2. BIOML Fundamentals
|
TOC
|
3.3 Protein elements
|
|