The GPM FAQ

The Global Proteome Machine Organization

GPM FAQ

Why doesn't the GPM use NCBI nr?
Why are some of the proteins labeled with names and others have only accessions numbers?
What does "homologue" mean?
What types of data files can I use?
How are GPM computers arranged?
How do I save my results?
Why is there a delay between filling in the search form and the search starting?
Are there any limits on searches using the GPM?
Why do I need an SVG plug-in to view spectrum graphs?
Can I donate a computer to be part of the GPM?

next >> See also: X! Tandem faq

1. Why doesn't the GPM use NCBI nr?

The GPM is a proteome-oriented tool: it is not simply trying to make long lists but to provide information about the concept of genomes interacting with transcriptomes interacting with proteomes.

NCBI's non-redundant list of protein sequences (nr) is not organized around the concept of genomes. It is simply a list of all of the sequences available for all organisms, with an effort made to remove sequences that have EXACTLY the same sequence. As little as one residue difference results in a new entry. Therefore, there is no unambiguous way to link a proteomic identification using nr sequences and the genome or transcriptome of the target organism.

Some NCBI sequences are included in the GPM release in a special file which is added to the list of proteins modeled. These sequences correspond to common experimental artifacts in samples, such as trypsin or bovine serum albumin.

2. Why are some of the proteins labeled with names and others have only accessions numbers?

The GPM does not contain information about all of the proteins in a particular list. It does have their accession numbers, however. When a protein is found which has not been found before, the user can click on the "protein" link, which will show all of the information about that particular protein. It will then go out to a good source of information that uses the accession number as a key, and pull back a set of relevant descriptions and further links about that protein. The machine then caches that information for use the next time that protein has been found. Therefore, if a particular protein does not have any information other than its accession number, it has never been identified before.

About once an hour the computers that make up the GPM exchange their cached information about proteins, so the knowledge that the protein exists will be shared throughout the system.

Model and homologue pages has a button on top of the page which will retrieve all of the annotation information about the proteins listed on the page. If the information isn't in the local cache, it may take up to 5 seconds per accession number to retrieve the information from the authoritative sources over the web.

3. What does "homologue" mean?

When a protein matches a set of MS/MS results, there are one or more spectra that serve as the evidence for that protein by matching to peptides contained in that protein. It is possible that some of those spectra will match other protein sequences as well, either because the other sequences contain very similar (or identical) peptide sequences.

When you click the "homologues" link on the main model page, you are taken to a list of proteins that also use some subset of the spectra used to match the protein listed on the main model page. No effort is made to determine if these proteins have truly homologuous sequences: all that is implied is that the proteins match at least one spectrum as well as the top scoring protein in the list.

If a protein on a homologue list matches additional spectra not matched to the top scoring protein, that sequence will also be shown on the main modeling page as well as the homologues page.

4. What types of data files can I use?

The GPM is set up to use DTA, PKL or MGF files. These formats are ASCII files that are generated by a mass spectrometer's data handling system.

This is an example of a pkl file that contains the values from more than one spectrum.

415.4407 347.4898 3
52.8570 1.1043
57.8380 1.1043
64.9675 1.1043
70.0623 1.1043
....
....
....

401.7685 318.7188 3
49.9661 1.1043
55.6181 1.1043
73.7716 1.1043
76.2013 1.1043
98.4095 1.1043
....
....
....

The first line has 3 values, each separated by a space. The first value (415.4407 and 401.7685) is the parent ion mass. The next value (347.4898 and 318.7188) is the parent ion intensity. The last value (3 and 3) is the parent ion charge. Each line after, this until there is a blank line, contains 2 values. Again they are separated by a space. The value pairs are the daughter ion masses and daughter ion intensities.

This is an example of a dta file that contains the values from more than one spectrum.

929.278 2
104.997 2
114.036 2
133.052 2
151.593 2
....
....
....

1003.2 2
108.084 2
123.007 2
126.249 4
142.525 4
....
....
....

The first line has 2 values, separated by a space. The first value (929.278 and 1003.2) is the parent ion mass. The next value (2 and 2) is the parent ion charge. Each line after, this until there is a blank line, contains 2 values. Again they are separated by a space. The value pairs are the daughter ion masses and daughter ion intensities.

This is an example of an MGF file that contains the values from more than one spectrum.

BEGIN IONS
PEPMASS=820.998855732003
CHARGE=1+
TITLE=Elution from: 0.14 to 0.14   period: 0   experiment: 2 cycles:  1
200.9942 2.3857
354.9856 2.3857
370.9314 5.1571
388.9714 9.6857
390.9608 2.7429
END IONS

BEGIN IONS
PEPMASS=691.910270874147
CHARGE=2+
TITLE=Elution from: 0.03 to 0.03   period: 0   experiment: 1 cycles:  1
264.8982 30.0286
264.9944 8.9429
435.8989 3.2857
442.9097 4.2571
478.9086 3.6571
END IONS

Each spectra is contained within a set of BEGIN IONS and END IONS tags. The value following PEPMASS= is the parent ion mass. The value following CHARGE= is the parent ion charge. The lines after the TITLE entry are pairs of daughter ion masses and intensities separated by a space.

5. How are GPM computers arranged and where are they?

The GPM is constructed from a set of computers that have the following properties:

they have the GPM installed on them;
they are all registered to have the same name (h.thegpm.org); and
they are all registered to the their own unique name (e.g. h112.thegpm.org).

Using a technique called domain name service round-robinning, when you go to h.thegpm.org, you are taken to one of the equivalent computers, pretty much at random. You will then stay at that computer until you close your browser. If you start your browser again, you may be taken back to the same site, or a different one, depending on how long your local domain name servers keep a record of your first trip.

The computers themselves are scattered around in a variety of locations and are connected to the Internet in a variety of ways. There is no one particular type of computer used: anything that can be used as an HTTP server should have enough computational power to be used by the GPM.

6. How do I save my results?

Modeling results are stored for about a week on GPM computers and then discarded. To save your results for archiving or to give to collaborators, there is a link on the top of each report page that allows you to download the "XML data file". All of the data and interpretation for a spectrum modeling run are stored in this file. You can save this file to your own computer (use the "File->Save as" menu item on your browser).

Once you have the results file stored on your computer, you can upload it back to a GPM computer for viewing. There is a link on the main search page, "View saved XML data" which allows you to upload the results file.

7. Why is there a delay between filling in the search form and the search starting?

Once you press the button to start the search, the MS/MS data file that you have selected must be sent to the GPM. If the file is large, then it may take a while for that data transmission to occur. Because the GPM is much faster than conventional software, waiting for the data transmission to occur is often the longest part of the search.

8. Are there any limits on searches using the GPM?

There are no limits as to the number of spectra that can be used, or the number of times that an individual or institution may use the machine.

9. Why do I need an SVG plug-in to view spectrum graphs?

The Scalable Vector Graphics (SVG) Language is a relatively new, open standard for drawing complicated objects on web pages. The free plugin supplied by Adobe allows zooming and panning, making it much more capable than using simple bitmaped graphics, such as GIF or JPEG pictures. Choosing SVG as the standard representation for graphics in the GPM will make it possible to produce much more sophisticated graphics in the future.

10. Can I donate a computer to be part of the GPM?

All of the computers in the GPM and their Internet connections have been donated. To donate a mirror site, simply us and we will take you through the registration and testing procedure necessary to bring up your own mirror computer on the network.