The Quartz Project is an effort to create collections of
annotated MS/MS data files, for use by bioinformatics groups for
testing and validating algorithms for peptide modelling and
identification. It consists of two types of collections:
- GPMDB-based (e.g., jasper); and
- project-based (e.g., amethyst).
GPMDB-based
- jasper
These projects contain information gathered directly from the public data available
in the GPMDB. The purpose of these data sets is to give bioinformatics researchers
direct access to mass spectra that have been assigned to peptide sequences. The data
is made available in the form of XML files that contain information about the peptide
assignment, the spectra and information required to tie a particular entry in a
file to the corresponding entry in the GPMDB.
Project-based
- amethyst
- opal
Each of the project-based collections will be posted both
in standard X! TANDEM annotation format (XML), as well as
individual DTA formatted spectra corresponding to each entry
in the XML collection.
Each of these collections contain 4 XML files and a directory of
numbered DTA files. The XML file names are composed of the name
of the collection and a suffix:
file suffix |
Description |
-gv |
"Genome Valid" - This file contains the
list of peptides found by searching all of the available spectra against the translation of the
human genome (34c NCBI 34). All of the assignments have expectation values
e < 0.001, as calculated by X! Tandem. Only the best protein sequence assignement
is included with each peptide, in cases where more than one protein contained the
matching peptide. |
-ngv |
"Non-Genome Valid" - This file contains the list
of peptides obtained by searching the "stochastic" spectra found
in the "-gs" file against the Human
Invitational Database (HIT) and the International
Protein Index Database (IPI). All of the assignments have expectation values
e < 0.001, as calculated by X! Tandem. If the peptide was found in both
databases, the entry from HIT was recorded. The peptides were checked against the
"-gs" file and if the same sequence was assigned in that file, with an
expectation value near the cut-off value of 0.001, those peptides were not included
in this file. |
-c |
"Complete" - This file contains all of the spectra in
the collection, represented in GAML format. The DTA folder contains the same collection
of spectra, with a single DTA file for each spectrum. The numbering system is the
same for the "-c" file and the DTA files. The numbering is consistent
throughout the data sets. |
-gs |
"Genome stochastic" - - This file contains the
list of peptides found by searching all of the available spectra against the translation of the
human genome (34c NCBI 34). All of the assignments have expectation values
e > 0.001, as calculated by X! Tandem. Only the best protein sequence assignement
is included with each peptide, in cases where more than one protein contained the
matching peptide. It should be noted that a number of the assignments in this
file may be correct, especially those with expectation values near the cut off
of 0.001. |
|