BIOML - Chapter 2. BIOML fundamentals

BIOML Proposal, 19990220

The Biopolymer Markup Language—BIOML
Working Draft Proposal

2. BIOML Fundamentals
TOC

2.1 Introduction
2.2 Logical layout–trees, branches and leaves
2.3 Logical layout using nested statements

2.1 Introduction

The BIOpolymer Markup Language is being designed to meet or exceed a number of goals that are critical for the development and acceptance of the language. BIOML must:

be extensibile, i.e., it should conform to the XML format;
be a faithful representation of the concept being described (protein/gene);
have the potential to be easily read by humans;
logically connect every element in a clearly expressed statement nesting structure;
include data that is not ASCII and support compression as a basic data type; and
support the conversion of other data files to and from BIOML.

These goals are laid out in order of importance. If any consideration affects one higher on the list, then the higher priority goal will prevail in any argument. The ability to logically connect data to a physical object's individual parts is a the main driving force behind the development of BIOML.

2.2 Logical layout–trees, branches and leaves

The diagram below shows a simple graphical relationship between a simple set of objects that can be associated with a "protein" object.

The fundamental object (a protein) is connected to two branch objects (its component pieces, subunit 1 and subunit 2) and one leaf object (its name). The first of the branches (subunit 1) is connected to another branch object (a peptide), which has a number of leaf objects associated with it. The linear nature of peptide and oligonucleotide biopolymers and the way that information about them has been gathered and organized makes it possible to draw such a graph for almost every concievable attribute and annotation of the biopolymer. BIOML is being designed to take advantage of this fact.

2.3 Logical layout using nested statements

The problem of writing down branched structures has been dealt with by computer scientists in a number of ways. The method used in XML is a very straightforward one. Using the example above, the protein is represented by an opening "tag" represented by "<protein>" and a closing tag "</protein>". Everything within those two tags is part of the tree illustrated above. Using this notation, the tree can be re-written as follows:

<protein>
    <name> ... </name>
    <subunit id="1">
        <name> ... </name>
        <peptide>
            <signal> ... </signal>
            <propeptide> ... <propeptide>
            ...
        </peptide>
    </subunit>
    <subunit id="2">
        <name> ... </name>
        <peptide id="1">
            ...
        </peptide>
        <peptide id="2">
            ...
        </peptide>
    </subunit>
</protein>

All of the relationships between items are the same as in the tree, but this format is very easy to write out using ASCII characters. The ellipsis "..." symbols represent any text that might be enclosed by the start and end tags. In the language of XML, the ideas that are represented by "protein" or "name" are called elements, while the symbols that are used to represent the start and end of the pieces of information that make up the element are called "tags".

1. Bioinformatics

TOC

3. Elements and tags