The Global Proteome Machine Organization
   GPM Blog
Search engine basics 101 #4, by Ron Beavis (2014/09/25)
Chromatographic influences on peptide identification rate calculations.
It is very common to analyze the set of spectra generated by an HPLC MS/MS experiment as a group, thereby obtaining an ordered set of peptide-to-spectrum matches (PSMs). The matches are then examined and statistical QA/QC measures applied, resulting in a reduced set of PSMs assumed to be true positive assignments. The efficiency of this process is often characterized by calculating the ratio of the number of true postive PSMs to the total number of MS/MS spectra acquired. This ratio (R) is also frequently used to characterize the performance of one identification algorithm versus others more ...
Data set of the week: (2014/9/15)
Uncovering global SUMOylation signaling networks in a site-specific manner.
Overall rating: excellent data (leading the field)
This data set consisted of 37 results, using affinity pulldown followed by reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD001061. It has been published by Hendriks IA, D'Souza RC, Yang B, Verlaan-de Vries M, Mann M and Vertegaal AC, Nat Struct Mol Biol. 2014 Sep 14 (PubMed).
Through a clever, well thought-out experimental scheme, this group has made it possible to study human protein SUMOylation on a large scale, with the same level of sensitivity and precision as is available for ubiquitylation. Their methods and results will set the standard for devising studies to investigate the biological function of this intriguing, reversable post-translational modification. They show quite convincingly that there are a wider range of lysine acceptor sites available for SUMOylation than only those predicted by the canonical motif φ-K-X-[DE], where φ is a hydrophobic residue and K is the acceptor site. Some proteins are shown to have many modifiable lysines. The results also demonstrate a considerable overlap between previously observed ubitinylation sites and this set of SUMOylation sites. The sample preparation, chromatography and mass spectrometry were all first rate and their SUMO-site sequence tag generated easy to identify peptides.
Homo sapiens and Mus musculus sequence and variant updates (2014/8/28)
The protein sequences used by GPM's public search sites have been updated to the most recent human and mouse ENSEMBL releases, H. sapiens v.76 (Genome Reference Consortium Human Build 38, GENCODE 20) and M. musculus v.76 (Genome Reference Consortium Mouse Reference 38 p.2, GENCODE M3). The matched single amino acid variants listings for each species have also been updated to the most recent set of non-synonymous single nucleotide variants available from ENSEMBL BioMart. The proteotypic profile files for X! P3 and annotated spectrum library files for X! Hunter have also been updated to correspond to these new protein sequence sets.
Search engine basics 101 #3, by Ron Beavis (2014/08/27)
Using parent ion mass accuracy histograms to understand deamidation.
Intact, folded proteins are relatively stable to environmental conditions, but they are slowly degraded by non-enzymatic chemical reactions, such as the oxidation of methionine or tryptophan residues or the hydrolysis of Asp-Pro bonds. However, once a protein has been cleaved into peptides in preparation for proteomics analysis, all of these degradation processes speed up due to the removal of the steric constrains provided by a folded protein's 2° and 3° structure. One of the degradation reactions that becomes particularly prominent in digested peptides is the conversion of Gln to Glu and Asn to Asp by the deamidation reaction (Yang et al.). This reaction results in a peptide mass change of 0.98402 Da, which is easy to detect using high resolution mass spectrometry more ...
Search engine basics 101 #2, by Ron Beavis (2014/08/17)
Using parent ion mass accuracy to evaluate subpopulations found in peptide MS/MS data.
When developing peptide identification algorithms one of the biggest problems is trying to evaluate the effects of changing some feature of the algorithm or its input parameters. You want to maximize the number of true positive identifications without adding false positives, but often it is difficult to be sure that you have achieved that end. The target-decoy simulation method is commonly used to assist making this type of decision. However, this single simulation can be difficult to interpret when a data set is composed of multiple distributions more ...
Search engine basics 101 #1, by Ron Beavis (2014/08/11)
An example of why simple ion counting is not commonly used as a score in proteomics search engines.
Recently, the idea has been put forward that simple "ion counting" can be used as a practical, reliable scoring system for proteomics software tasked with identifying peptides based on tandem mass spectra (Wenger CD, et al., and Zhang B, et al.). The algorithm associated with this scoring system is simpler than those employed by the existing search engines (e.g., Mascot, X! Tandem, OMSSA, etc.), making practical implementation easier and the results more immediately comprehensible more ...
Data set of the week: (2014/7/28)
Extracellular matrix signatures of human primary metastatic colon cancers and their metastases to liver.
Overall rating: very good data (general interest)
This data set consisted of 176 results, using multidimensional chromatography (off-gel electrophoresis followed by HPLC) prior to mass spectrometry. The data files were made available through Massive, MSV000078555. It has been published by Naba A, Clauser KR, Whittaker CA, Carr SA, Tanabe KK, RO Hynes, BMC Cancer. 2014 Jul 18;14(1):518 (PubMed).
This data comes from a well done study of the differences in extracellular matrix associated with colon cancer, comparing the extracellular matrix found in normal colon tissue, cancerous tissue and metastatic colon cancer tissue found in the liver. All of the data was collected from clinical samples and it nicely demonstrates the type of data and protein identifications that can be obtained from this type of very heterogenous, largely non-cellular tissue extract. The methods used seemed to work quite well, although the data shows some difficulties with maintaining the parent ion mass calibration — a common problem with the last few generations of Orbitrap-based instruments.
Data set of the week: (2014/7/19)
Development and performance evaluation of an ultra-low flow nano liquid chromatography-tandem mass spectrometry set-up.
Overall rating: excellent data (leading the field)
This data set consisted of 39 results, exploring the role of HPLC flow rate and analysis duration in LC/MS/MS measurements. The data files were made available through ProteomeXchange, PXD000396. It has been published by Köcher T, Pichler P, Pra MD, Rieux L, Swart R, and K Mechtler, Proteomics. 2014 Jun 11 (PubMed).
This data set represents a tour de force exploring the relationship between chromatographic methods and proteomics results. This group has achieved a degree of reproducibility and quality control using their nanoLC system that clearly leads the field. From a technical point of view, many of the LC/MS/MS runs (e.g., 120312QEx2_RS1_20nl-min_0k1HeLa_14h_01.msf) were simply the best we've ever seen in 10 years of operation. Anyone interested in studying the relationship between quality parameters — dynamic range, LOD or LOQ — and the number of spectra acquired should examine this data set carefully.
GO annotation listing build complete for three model species (2014/7/9)
Following the usual quarterly update of the human and mouse proteome guides (v. 15), the listings for human, mouse and yeast GO code annotations were also rebuilt. The resulting files contain the proteins associated with a specific GO code, the number of times the protein has been observed and the usual GPMDB evidence code for these proteins. The GO codes available are indexed and the individual listings are available from the main GPMDB site in three text formats:
  1. H. sapiens, 11,958 GO categories;
  2. M. musculus, 11,339 GO categories; and
  3. S. cerevisiae, 4,309 GO categories.
The text files and HTML indexes are also available for download via FTP.
Data set of the week: (2014/7/7)
Proteomic analysis of the multimeric nuclear egress complex of human cytomegalovirus
Overall rating: very good data (general interest)
This data set consisted of 24 results, using LC/MS/MS to probe the consequences siRNA gene silencing experiments. The data files were made available through ProteomeXchange, PXD000536. It has been published by Milbradt J, Kraut A, Hutterer C, Sonntag E, Schmeiser C, Ferro M, Wagner S, Lenac T, Claus C, Pinkert S, Hamilton ST, Rawlinson WD, Sticht H, Coute Y and Marschall M, Mol Cell Proteomics. 2014 Jun 26 (PubMed).
Human cytomegalovirus — a.k.a, HCMV, CMV and Human herpesvirus 5 — infections are extremely common (> 50% of the population). The virus does not produce clinical symptoms in most of the infected, but it remains dormant for long periods of time and can result in serious disease in immuno-compromised individuals. It can also be passed from the mother to fetus and give rise to developmental abnormalities. This study does a good job of demonstrating the utility of combining proteomics and siRNA techniques for the study of viral protein production. The sample preparation, chromatography and mass spectrometry are well done. Any group interested in studying viral dynamics in host cells using proteomics should take a look at the methods used to generate this data set and the results obtained from these studies.
Data set of the week: (2014/6/29)
Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives.
Overall rating: very good data (general interest)
This data set consisted of 249 results, using affinity pull-down sample preparation and LC/MS/MS analysis with SILAC quantitation. The data files were made available through ProteomeXchange, PXD000143. It has been published by Spruijt CG, Gnerlich F, Smits AH, Pfaffeneder T, Jansen PW, Bauer C, Münzel M, Wagner M, Müller M, Khan F, Eberl HC, Mensinga A, Brinkman AB, Lephikov K, Müller U, Walter J, Boelens R, van Ingen H, Leonhardt H, Carell T and Vermeulen M, Cell. 2013 Feb 28;152(5):1146-59 (PubMed).
This study utilizes the same methods commonly used to determine protein-protein interactions to determine which proteins have a special affinity for DNA containing 5-methylcytosine, 5-(hydroxy)methylcytosine, 5-formylcytosine and 5-carboxylcytosine. The experiments were performed using mouse embryonic stem cells as the source of potential interactor proteins. The experiments were consistently done and the analysis was of very good quality. The proteins selected showed considerable enrichment of those known to be part of the nucleolus, nucleus, ribonucleoprotein complex, ribosome and spliceosome. This method of sample preparation produced many of the best observations of comparatively rare gene products, such as Pcgf1:p, Aurkc:p and Gm5590:p.
Data set of the week: (2014/6/23)
The Global Phosphoproteome of Chlamydomonas reinhardtii Reveals Complex Organellar Phosphorylation in the Flagella and Thylakoid Membrane.
Overall rating: excellent data (worth study)
This data set consisted of 6 results, using a multi-step phosphopeptide enrichment strategy followed by a multidimensional chromatography separation using a HILIC initial separation and subsequent reversed-phase HPLC. The data files were made available through ProteomeXchange, PXD000783. It has been published by Wang H, Gau B, Slade WO, Juergens M, Li P and Hicks LM, Mol Cell Proteomics. 2014 Jun 10 (PubMed).
Chlamydomonas reinhardtii is a widely used model algae species. It is unicellular with two flagella and it is capable of photosynthesis. The organism is commonly found in the environment and it can be grown under very minimal conditions compared to most eukaryotes. This study used global phosphoproteomics methods to determine how the organism utilizes protein phosphorylation in its metabolic processes. The results showed good enrichment of phosphopeptides (> 70% of identified spectra). The ratio of S:T phosphorylation was a little lower than many other eukaryotes (about 4:1), but the degree of proline-directed phosphorylation detected was noticeably less than normally found in mammalian studies. The data quality was excellent and these spectra would be suitable for developing algorithms or testing computational biology methods for phospho-protein biology.
Data set of the week: (2014/6/15)
A Candida albicans PeptideAtlas.
Overall rating: excellent data (leading the field)
This data set consisted of 148 results, from 16 distinct experiments. The data files were made available through PeptideAtlas, PASS00402, PASS00408, PASS00476, and PASS00447. It has been published by Vialas V, Sun Z, Loureiro y Penha CV, Carrascal M, Abián J, Monteoliva L, Deutsch EW, Aebersold R, Moritz RL, and Gil C, J Proteomics, 2014 Jan 31;97:62-8 (PubMed).
Candida albicans is a fungus that can exist either as single cells or filaments. It is a commensal organism in H. sapiens, occupying the oral cavity and gastrointestinal tract in most of the population. C. albicans can also cause a variety of infections —particularly in oral and gential tissues — in immunocompromised individuals. It belongs to a large group of fungi, the mitosporic Saccharomycetales, that that contains many human pathogenic organisms. Unfortunately, these fungi have not had much attention from the proteomics community. This dataset starts to correct this problem, defining the observable peptides and proteins from C. albicans samples under a variety of experimental conditions. The sample preparation and separations were very well done and the mass spectrometry was state-of-the-art.
Data set of the week: (2014/6/7)
Functional annotation of proteome encoded by human chromosome 22; and
A draft map of the human proteome.
Overall rating: excellent data (general interest)
This data set consisted of 84 results, each one a summary of individual LC/MS/MS runs associated with multidimensional chromatography analyses of individual tissue samples. The data files were made available through ProteomeXchange, PXD000561. It has been published by Pinto SM, Manda SS, Kim MS, Taylor K, Selvan LD, Balakrishnan L, Subbannayya T, Yan F, Prasad TS, Gowda H, Lee C, Hancock WS, and Pandey A, J Proteome Res. 2014 Jun 6;13(6):2749-60 (PubMed).
This set of data was one of the first attempts to broadly sample human tissues using similar experimental methods for each sample. It contained some of the first publicly available data for several tissues, in some cases from both fetal and adult samples. Analysis of the data produced numbers of protein identifications typical for the methods used, although the results for some tissues (e.g., liver, heart) were surprisingly variable. Overall, the chromatography and mass spectrometry were well done and consistent between samples. There was considerable variablity between the samples with respect to the presence of detectable experimental artifacts caused by the modification of free peptide amines: both N-terminal and lysine side chain amines were either carbamylated or carboxyamidomethylated to a significant extent. These artifacts made the data of limited use for detecting some modifications — particularly acetylation or ubiquinatiion — or amino acid polymorphisms. Other modifications that were not easily confused with these artifacts were present and available for interpretation. For example, differences in the hydroxyproline distributions on many collagen subunits could be readily observed in different tissues. The phosphorylation states of some common proteins could also be readily observed across multiple tissues.
The Contest: testing large-scale proteomics information systems
In addition to the publication listed above, this data was also the basis for "A draft map of the human proteome", describing the web site www.humanproteomemap.org. The purpose of the web site was to allow researchers to enter a list of gene symbols and then display the relative amount of the associated protein that was detected in each of the tissues examined. The data was analyzed using methods commonly used for small, single LC/MS/MS runs applied to these much larger data sets.
One of the best ways to evaluate this type of informatics system is to perform "sanity" tests to see how well the output of the system corresponds to known patterns of protein expression. Since evaluating this type of system is an important skill for anyone who wants to be involved in large-scale proteomics, we thought it would be an excellent subject for a contest. Two lists of genes were selected to probe the quality and utility of the proteomics information available and the results of querying www.humanproteomemap.org with these lists were downloaded as PDF files:
Everyone with an interest in the subject is invited to take a look at these two results and write a 250 word essay on their implications for the biological, technical and biomedical utility of the web site's information. The best essay will be published on this blog (anonymously if you prefer) and the author will recieve a beautiful GPMDB T-shirt. Submit your entries by email to contact@thegpm.org, using the subject line "GPMDB T-Shirt contest". Please stick to the facts as much as possible: sarcasm, irony or ad hominem comments will count against any entry. Entries may be submitted until midnight July 1, 2014. Multiple entries from the same individual are allowed, but the author must clearly identify themselves in the email. The winner will be announced July 7, 2014.
Data set of the week: (2014/5/25)
Virion proteome of Cafeteria roenbergensis virus strain BV-PW1.
Overall rating: excellent data (general interest)
This data set consisted of 10 results, consisting of 9 gel bands and a summary set of identifications. The data files were made available through ProteomeXchange, PXD000993. It has been not yet been published, but was submitted by Matthias Fischer (Max Planck Institute for Medical Research) and Leonard Foster (University of British Columbia).
This elegant dataset neatly wraps up the preliminary work on the proteome of a recently discovered nucleocytoplasmic large DNA virus, the Cafeteria roenbergensis virus. The host species, Cafeteria roenbergensis, is a marine flagelate that consumes bacteria in coastal water. The virus has a very large genome of about 730,000 base pairs of dsDNA and 1,096 predicted proteins. The virus is also large enough that it can be infected by a virophage, the Mavirus. Not only is the virus biologically interesting, but the data is one of the best we've run across for testing peptide identification algorithms and the theory behind them. The chromatography and mass spectrometry were both very well done and the spectra are ideal for detecting common artifactual modifications that can be masked by dodgy experimental technique, such as deamidation and peptide N-terminus cyclization. It is also useful for trying to understand how to think about the problem of balancing sensitivity versus selectivity and false positive versus false negative assignments.
Copyright © 2013, The Global Proteome Machine Organization. Privacy Statement