The Global Proteome Machine Organization

The Global Proteome Machine Organization

  The Quartz Projects

The Quartz Project is an effort to create collections of annotated MS/MS data files, for use by bioinformatics groups for testing and validating algorithms for peptide modelling and identification. It consists of two types of collections:

  1. GPMDB-based (e.g., jasper); and
  2. project-based (e.g., amethyst).

GPMDB-based

  1. jasper

These projects contain information gathered directly from the public data available in the GPMDB. The purpose of these data sets is to give bioinformatics researchers direct access to mass spectra that have been assigned to peptide sequences. The data is made available in the form of XML files that contain information about the peptide assignment, the spectra and information required to tie a particular entry in a file to the corresponding entry in the GPMDB.

Project-based

  1. amethyst
  2. opal

Each of the project-based collections will be posted both in standard X! TANDEM annotation format (XML), as well as individual DTA formatted spectra corresponding to each entry in the XML collection.

Each of these collections contain 4 XML files and a directory of numbered DTA files. The XML file names are composed of the name of the collection and a suffix:

file suffix Description
-gv "Genome Valid" - This file contains the list of peptides found by searching all of the available spectra against the translation of the human genome (34c NCBI 34). All of the assignments have expectation values e < 0.001, as calculated by X! Tandem. Only the best protein sequence assignement is included with each peptide, in cases where more than one protein contained the matching peptide.
-ngv "Non-Genome Valid" - This file contains the list of peptides obtained by searching the "stochastic" spectra found in the "-gs" file against the Human Invitational Database (HIT) and the International Protein Index Database (IPI). All of the assignments have expectation values e < 0.001, as calculated by X! Tandem. If the peptide was found in both databases, the entry from HIT was recorded. The peptides were checked against the "-gs" file and if the same sequence was assigned in that file, with an expectation value near the cut-off value of 0.001, those peptides were not included in this file.
-c "Complete" - This file contains all of the spectra in the collection, represented in GAML format. The DTA folder contains the same collection of spectra, with a single DTA file for each spectrum. The numbering system is the same for the "-c" file and the DTA files. The numbering is consistent throughout the data sets.
-gs "Genome stochastic" - - This file contains the list of peptides found by searching all of the available spectra against the translation of the human genome (34c NCBI 34). All of the assignments have expectation values e > 0.001, as calculated by X! Tandem. Only the best protein sequence assignement is included with each peptide, in cases where more than one protein contained the matching peptide. It should be noted that a number of the assignments in this file may be correct, especially those with expectation values near the cut off of 0.001.
Copyright © 2004-2011, The Global Proteome Machine Organization