BIOML Proposal, 19990220
The Biopolymer Markup Language — BIOML
Working Draft Proposal
3. Elements and tags
TOC
3.1 Introduction
3.2 Gene-specific elements
3.2.1 Summary
3.2.2 A simple <gene> example
3.3 Protein-specific elements
3.3.1 Summary
3.3.2 Simple <protein> examples
3.4 General elements and tags
3.4.1 General purpose elements
3.4.2 Organism-identifying elements
3.4.3 Location elements
3.4.4 Literature references
3.4.5 Database reference
3.4.6 URL-based resources
3.4.7 Binary data
3.4.8 Forms
3.4.9 Global attributes and entities

3.3 Protein-specific elements
Summary
In the same way that "gene" is a concept, "protein" is also a concept that existed long before there was any knowledge of its physical structure. A protein was originally conjectured to be the fundamental particle that performed some specific function, such as catalysis in the case of enzymes. Since there original conjecture, the complete structure of a large number of proteins has become known. Proteins are composed of one or more subunits. Each subunit is composed of one or more linear polypeptide molecules, which are polymers of twenty different amino acids (called residues). Within one subunit or polypeptide chain there may be additional bonds between individual cysteine residues, leading to a more complex, non-linear bonding structure. Many amino acids can be modified once they have been incorperated into a polypeptide (a post-translational modification) and the presence of these modifications may have a strong influence on the functionality of the final protein molecule.
The following elements are proposed to express the idea of a protein and its composition of subunits with their component, highly ordered polypeptide chains. The concepts of modification and annotation of peptide-specific structures is also included.
Protein elements The highest level attributes for a protein is the protein itself and the subunits that make up the protein. Subunits are the polypeptide chains that are assembled (usually non-covalently) to make up the protein. For the purposes of BIOML, a subunit is any peptide components of a protein that are the results of the translation of a single mRNA. The possibility of more than one peptide component exists because of the possibility of editing of a translated mRNA peptide by enzymes within a cell. The various elements that result from this editing are described below as domains.

Element Attributes Functions
protein comp Encloses a protein
subunit comp Encloses a subunit
homolog Another organism with this peptide sequence
Peptide elements Peptides are the long chains of amino acid residues that make up a protein subunit. Peptide chains can be divided up into functional regions, called domains, which may have particular structural or functional attributes. Many gene products are made in a slightly longer form than is present in the mature protein. If a domain is removed to turn the gene product into a functional protein, it is called a "propeptide" (signified as <domain type="propeptide">). Many gene products are made with a long (20–30 residue), hydrophobic domain at the N-terminus of the chain, which is removed almost immediately following translation. This type of domain is called a "signal" peptide (<domain type="signal">). Peptide domains that remain in the mature protein are designated by <domain type="mature">. Other types of domains are "alpha-helix", "beta-strand" and "beta-turn", with obvious meanings. A special type of domain is also used to designate regions of the sequence that may be found with more than one sequence, signified by <domain type="variable">. The precise location and assignment of these domains has only been performed for a small selection of proteins: many more have been assigned by analogy to known domain structures.
It is very possible that some domains may overlap. Therefore it is not necessary to enclose the residues in a domain with a domain element's tags. A domain is an element of the enclosing peptide and can be specified using a single <domain/> tag anywhere within that peptide. It is important to note that domains always belong to a peptide: they are not peptides themselves.

Element Attributes Functions
peptide start
end
Encloses a peptide
domain start
end
id
type
Specifies a generalized peptide domain
Amino acid elements Individual amino acids are the building blocks of proteins. These amino acids can be modified in a variety of ways. They can also be cross-linked to other amino acids, either by disulphide bridges between cysteine residues or by the presence of other cross-linking groups.

Element Attributes Functions
aa type
at
to
An amino acid element

Note: "to" applies for type="C" only
amod at
type
occ
An amino acid modification
alink at
to
type
occ
A generalized crosslink between two amino acids
avariant at
type
occ
A possible amino acid variant at a particular site
3.3.2 Simple <protein> examples A problem that was not addressed above is how to refer to individual subunit, peptides and amino acids, in cases such as the representation of the composition of a protein, or to clearly state cross-links between different peptides in a single subunit. In BIOML this problem is addressed by systematically using the id attributes in subunit and peptide tags. Each new peptide in a subunit is given a sequential numerical id, starting with "1" for the first peptide written for a subunit. Similarly, subunits are given numerical id values, begining with "1" for the first subunit written for a protein. These numbers are used to refer to those elements.
For example, if a <protein> consists of two copies each of two different subunits, <subunit id="1"> and <subunit id="2">, the the protein's tag can be written as

<protein comp="2xS[1]+2xS[2]">.

Similarly, a cysteine in <subunit id="1">, in <peptide id="1"> at position 5 that is cross-linked to a cysteine in <subunit id="2">, in <peptide id="2"> at position 20 can be completely described the the tag

<aa type="C" at="S[1]P[1]A[5]" to="S[2]P[2]A[20]"/>
or
<aa type="C" at="5" to="S[2]P[2]A[20]"/>.

Any redundant information can be left out of this type of description. In this example, the specification of at="S[1]P[1]A[5]" is not required: the second tag has the same information because the enclosing peptide and subunit tags will clearly indicate to which subunit and peptide that amino acid belongs. It is also possible to include domains (D[]) in this nomenclature, using the general notation S[...]P[...]D[...].
The best way to see how to apply these tags is through real examples. The following example demonstrates a simple BIOML file for human insulin. The file uses a few tags from the next section, but they should be self explanitory. Try to work through the logic of the example, remembering that everything between the start and end tags for a particular element "belong to" that element. Also remember that if a set of element tags don't enclose anything, then it is acceptible to just write the opening tag with a "/>" at the end of the tag description.
Example A. Insulin example.
Example B. Insulin single gene product. This example is considerably more complicated, showing the use of multiple, overlapping domains and variant amino acids.
Example C. Insulin gene and gene product. This example includes the information contained in the previous two examples, but it also integrates the gene for insulin into the document.

3.1 Introduction TOC 3.4 General elements