The X! search engine project

X! Search Engine Development

  X! P3 (Proteotypic Peptide Profiler) Project

The idea of using "proteotypic peptides" is a relatively new notion in protein/peptide identification. It is simply the recognition of the fact that if you cleave a protein into peptides, not all of the peptides are equally likely to be detected by current mass spectrometry-based techniques. Some peptides from a particular protein sequence are detected easily, while others are very difficult to find. The peptides generated from a sequence that are always detected are called proteotypic, i.e., those peptides alone are indicative of a the presence of a particular protein.

This idea suggests that it should be possible to scan through a set of data, for example an LC/MS/MS run, looking only for the known proteotypic peptides for a particular organism. Finding those proteotypic peptides is enough to know that the protein was present in the original sample. Because there will only be a few proteotypic peptides for a protein, it should be possible to improve both the speed and accuracy of the resultant protein identifications.

The X! P3 (Proteotypic Peptide Profiler) project is the first publically available search engine that takes advantage of this idea. Built using the X! TANDEM refinement idea and the open source X! TANDEM code, X! P3 takes the proteotypic peptide idea to its logical conclusion by adding a few simple steps. Rather than simply identifying the proteins, a proteotypic approach is used to find protein sequences and then refinement is used on the full spectrum data set to find all of the peptides present, as well as looking for post-translational modifications, point mutations and unanticipated peptide cleavages. It works this way:

  1. In the first round, the spectrum data set is examined for the presence of proteotypic peptides.
  2. The full protein sequences of the proteins identified in the first round are then pulled from a sequence library.
  3. Using this small set of full sequences, multiple rounds of refinement are performed to extract all of the non-proteotypic peptides from the full spectrum data set

A potential problem with this type of approach is clearly the lack of a good set of proteotypic peptides to use. This has been solved through the GPMDB, which is the largest collection of proteomics data available to the public. By querying GPMDB to find the best peptides representative of a particular protein, it is now possible to produce very good quality libraries of these peptides for two model organisms, namely Homo sapiens and Saccharomyces cerevisiae, as well as several commonly observed experimental artifacts, such as BSA and trypsin. The sequence libraries are updated daily from GPMDB, so the system has the ability to learn about new proteotypic peptides, as they are generated by the overall Global Proteome Machine.

An X! P3 server has been established for these model species. Please give it a try. We will be releasing the X! P3 code thorough our code repository and the sequence libraries by ftp at ftp://ftp.thegpm.org/proteotypic_peptide_profiles.

Copyright © 2004-2011, The Global Proteome Machine Organization