BIOML - Chapter 3. Elements and tags

BIOML Proposal, 19990220

The Biopolymer Markup Language—BIOML
Working Draft Proposal

3. Elements and tags
TOC

3.1 Introduction
3.2 Gene-specific elements
	3.2.1 Summary
	3.2.2 A simple <gene> example
3.3 Protein-specific elements
	3.3.1 Summary
	3.3.2 Simple <protein> examples
3.4 General elements and tags
	3.4.1 General purpose elements
	3.4.2 Organism-identifying elements
	3.4.3 Location elements
	3.4.4 Literature references
	3.4.5 Database reference
	3.4.6 URL-based resources
	3.4.7 Binary data
	3.4.8 Forms
	3.4.9 Global attributes and entities

3.1 Introduction

BIOML is an application of XML, therefore this proposal uses the terminology of XML to describe it. Briefly, XML data is composed of Unicode characters (which include ordinary ASCII characters), "entity references" (informally called "entities") such as "&" which usually represent "extended characters" such as the ampersand, and "elements" such as
<note id="123">Hi</note>.
Elements enclose other XML data called their "content" between a "begin tag" and an "end tag" much like in HTML. There are also "empty elements" such as <file/>, whose begin tag ends with /> to indicate that the element has no content or end tag. The begin tag can contain named parameters called "attributes", such as id="123" in the example above. For further details on XML, consult the XML specification.

Because XML is case-sensitive, BIOML element and attribute names are case-sensitive and all lowercase. Note, however, that all BIOML element and attribute names consist solely of ASCII characters, for which case insensitivity is trivially well-defined, and do not need to be distinguished by case.

In formal discussions of XML markup a distinction is maintained between an element, such as an protein element, and the tags <protein> and </protein> marking it. What is between the <protein> begin tag and the </protein> end tag is the protein element's content. An "empty element" (e.g., file) is defined to have no content and so has a single tag of the form <file/>. Usually, the distinction between elements and tags will not be so finely drawn in this specification. For instance, we will sometimes refer to the <protein> and <file> elements, really meaning the elements that use these tags. Using the tags as references to elements makes them visually distinguishable from references to attributes. However, the words "element" and "tag" themselves will be used strictly in accordance with XML terminology.

3.2 Gene-specific elements
Summary

A "gene" is the concept that lead to the modern science of genetics. It was originally thought of as an invisible element of information that was distributed from parents to children during the process of reproduction. It is now known that these elements are specific stretches of DNA, found organized into larger structures called chromosomes. It is the sequence of oligonucleotides in linear DNA polymers that defines the polypeptide molecule that will be constructed when that stretch of DNA undergoes transcription. Adjacent portions of DNA determine when a particular piece of DNA will be transcribed, although the entire mechanism that turns on and off transcription to produce a differentiated cell is not clearly understood. When DNA is transcribed, it is not used to directly construct a polypeptide. Instead, it is used to construct a messenger RNA molecule (mRNA). In prokaryotic organisms–e.g., all bacteria—the mRNA is then transported to the apparatus that will read it to make a specific polypeptide. In eukaryotic organisms—e.g., all animals and plants—the mRNA is frequently edited before it leaves the nucleus, removing sections that do not code for a polypeptide (removing the "introns"). The individual nucleotide residues do not code for a particular amino acid in the final polypeptide: the nucleotides are read as triplets, with a redundant code table providing the translation to amino acids.

The following elements are proposed to express the idea and physical reality of a gene as a highly ordered piece of DNA located in a chromosome. Some additional elements are included to specify the related object, a messenger RNA molecule.

Highest level elements

These elements describe the location of a particular piece of DNA within an organism's compliment of chromosomes. The equivalent for prokaryotic organisms is the location on a specific plasmid.

Element	Attributes	Functions
chromosome	number	Encloses a chromosome
sts_domain	start end	Encloses a region of a chromosome delimited by two Sequence Tagged Sites.
locus	start end	Encloses a locus description
clone	—	Encloses a clone description
plasmid	—	Encloses a plasmid description

DNA elements

These elements describe the actual stretch of DNA in the vicinity of a region that encodes for a protein, i.e., a gene. Upstream from the gene are regions that regulate its transcription, such as a promotor region.

Element	Attributes	Functions
dna	start end	Encloses an oligonucleotide composed of DNA
promotor	start end	Encloses a promotor
gene	comp	Encloses a gene
exon	start end type	Encloses an exon
intron	start end	Encloses an intron
ddomain	type start end	Encloses a DNA domain
da	type	A deoxyribonucleotide residue
dmod	atbr>type occ	A deoxyribonucleotide modification
dvariant	at type occ	A deoxyribonucleotide variant at a particular site
dstart	at	A start codon
dstop	at	A stop codon

RNA elements

RNA is responsible for moving the code for a protein out of the nucleus and into the endoplasmic reticulum, where it is read by ribosomes. The structure of RNA is simpler that the original DNA: a significant amount of editing has already occured.

Element	Attributes	Functions
rna	start end	Encloses an oligonucleotide composed of RNA
rdomain	type start end	Encloses an RNA domain
ra	type at type	A ribonucleotide residue
rmod	at type occ	A ribonucleotide modification
rvariant	at type occ	A ribonucleotide variant at a particular site
rstart	at	A start codon
rstop	at	A stop codon

3.2.2 A simple <gene> example

A very simple example of a gene which is composed of a very short oligonucleotide sequence is as follows (this was taken from the sequence of the Drosophila melanogaster gene for ubiquitin).

Comments

BIOML

start BIOML 
start this gene 
start DNA strand 
a stretch of DNA 
a start codon 

a stretch of DNA 
end codon 

a stretch of DNA 
end DNA strand 
end this gene 
end BIOML

<bioml>
<gene>
  <dna start="1" end="41">
    GCAGCGACGA CC
    <dstart at="13">ATG
    </dstart>
    TCCGG CGCCACCGAG 
    <dend at="30">TAG
    </dend>
    TCGGGCT C
  </dna>
</gene>
</bioml>

In this example, the letters "G", "C", "A" and "T" have their normal meanings as individual dna nucleotides (they can be either lower or upper case). White space (spaces, tabs, carriage returns and linefeed characters) are ignored by the parser, but they can be freely added to aid the flow and readability of the file. The parser also should ignore any character that cannot be a nucleotide residue, allowing the author to include numbers and other symbols that make reading the file easier. The <bioml> tags indicate that the elements contained within the tags are to be interpreted as BIOML elements.

This example demonstrates BIOML's construction and flow of logical connection between elements. The <nstart> element's connection to the other elements is determined by following enclosing elements travelling upward through the nested statements:

<dstart> belongs to
<dna> belongs to
<gene>.

Similarly, the following statements are logically correct:

<dend> belongs to
<dna> belongs to
<gene>.

However, the statement "<dstart> belongs to <dend>" is not true, because the <dend> tags do not enclose the <dstart> tags. Both of these elements belong to the <dna> element, but they are separate from each other.

This example is not complete, however. In practice, one would want to add in many details– that is the whole point of BIOML. These details would be the identity of the organism, literature references for this gene, notes on the function of the gene, its placement in a chromosome, et cetera. Section 3.4 explains how this can be done.

A further note on the <gene> element

Genes are fundamental to describing how chromosomes work. The current method for locating a gene on a chromosome is to first identify a region of DNA on a chromosome that contains the gene. This large section of DNA is called a "locus". The locus will contain all of the DNA necessary to code for a gene, but it also contains regulatory DNA and other regions of DNA that are of unknown function. The logic that has been chosen to describe a gene within a locus is as follows:

<locus> contains
<gene> contains
<dna> contains
<exon> and <intron> and <ddomain>.

We would like to recommend that the DNA domain elements — <exon/>, <intron/> and <ddomain/> — should be used as empty elements. Rather than attempting to enclose the appropriate sections of a complicated locus with the start and end tags, placing the domains at the beginning of a <gene> element makes editting and reading the file much easier. If you need to attach annotation to one of these domains, then enclose the annotation with the appropriate domain start and end tags. Take a look at the insulin example to see how this type of scheme works.

2. BIOML Fundamentals

TOC

3.3 Protein elements