The Global Proteome Machine Organization
   News Archive
Data set of the week: (2014/12/20)
Different binding motifs of the celiac disease-associated HLA molecules DQ2.5, DQ2.2, and DQ7.5 revealed by relative quantitative proteomics of endogenous peptide repertoires.
Overall rating: excellent data (leading the field)
This data set consisted of 51 results, analyzed by reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD001205. It has been published by Bergseng E, Dorum S, Arntzen MO, Nielsen M, Nygard S, Buus S, de Souza GA, and Sollid LM, Immunogenetics. 2014 Dec 12 (PubMed).
The combination of an interesting question, experimental protocol, sample preparation and excellent experimental technique make these results truly remarkable. This study focusses on the very specific peptides bound to selected types of MHC II-type antigen presentation complexes, demonstrating that it is possible to selectively (and sensitively) observe these biologically and clinically important peptides. The peptide signals themselves are strong and unambiguous, making this data set probably the most interesting collection of endogenous peptides to be made publicly available to date. This data set should appeal to immunologists (who may be puzzled by the proteins represented), computational biologists (who should want to understand why these peptides were chosen), bioinformaticians (who want to understand the observation of non-tryptic peptides) and clinicians (who want to understand the immunological basis of celiac disease).
Data set of the week: (2014/12/10)
Adenovirus composition, proteolysis, and disassembly studied by in-depth qualitative and quantitative proteomics.
Overall rating: very good data (specialist interest)
This data set consisted of 5 results, using three different protease digests analyzed by reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD000591. It has been published by Benevento M1, Di Palma S, Snijder J, Moyer CL, Reddy VS, Nemerow GR, and Heck AJ, J Biol Chem. 2014 289:11421-30 (PubMed).
This study nicely demonstrates the level of detail regarding a virus' proteome that can be obtained in short order using modern methods. With only five experiments, the authors were able to almost fully characterize the proteins present in a type 5 human adenovirus (HAdV) vector. They could then use this information to create SRM assays for each of the proteins in the vector and use the assays to perform quantitative experiments. While not emphasized in the manuscript, the data also makes it possible to determine which human proteins co-purify with the viral particles, although it doesn't contain enough information to determine whether these proteins were incorporated into the virons themselves or simply adhered during purification.
Data set of the week: (2014/12/2)
Site-specific mapping and quantification of protein S-sulphenylation in cells.
Overall rating: very good data (specialist interest)
This data set consisted of 24 results, using a combination of affinity isolation followed by reversed phase HPLC MS/MS. The data files were made available through the CPTAC Portal. It has been published by Yang J, Gupta V, Carroll KS and Liebler DC, Nat Commun. 2014 Sep 1;5:4776 (PubMed).
This study makes use of an interesting click-chemistry reagent as part of an affinity purification scheme to isolate peptides that had the transitory PTM cysteine sulphenylation. The PTM (the oxidation of the sulphydryl side of cysteine, SH -> S-OH) had been previously detected at the protein level, but this study is the first to track it back to the specific cysteine acceptor sites that use the modificiation. In addition to identifying the acceptor residues, the reagent had a +6 Da "heavy" version that allowed for relative quantitation studies. The experiments were well done and the reagent appears to work very well for the purpose, resulting in LC/MS/MS runs with more that 15% of identified peptides corresponding to the desired modification. The DYn-2-triazohexanoic acid modification produced a significant shift in the chromatographic retention to later in the gradient for the labelled peptides, making validation of the identifications very straightforward.
GPMDB REST API, version 2 (2014/11/27)
We have begun to roll out the version 2 features of the GPMDB REST API. The first set of the new version 2 methods (listed here) were designed to make it easy to determine which bases on the human genome are associated with specific post-translational modifications (PTMs) that have been observed in the proteome and recorded in GPMDB. The PTM acceptor sites have been curated to ENSEMBL v. 70 human proteome and the GrCH37 version of the human genome. The methods can be used for interpreting the results of any genome or transcriptome study that discovered missense nucleotide variants in terms of the effect of those variants on the PTM status of the associated protein splice variants in ENSEMBL v.70.
Version 2 is a stand-alone set of new methods. The methods associated with version 1 (listed here) will remain the same and will be accessible from the same URLs as before. No changes to the version 1 interface are contemplated at this time.
Data set of the week: (2014/11/25)
Characterization of native protein complexes and protein isoform variation using size-fractionation-based quantitative proteomics.
Overall rating: excellent data (worth study)
This data set consisted of 120 results, using a combination of native size-exclusion chromatographic (SEC) separation followed by reversed phase HPLC MS/MS of the SEC fractions. The data files were made available through ProteomeXchange, PXD001220. It has been published by Kirkwood KJ1, Ahmad Y, Larance M, and Lamond AI, Mol Cell Proteomics. 2013 12:3851-73 (PubMed).
The rather innovative study examines the use of native size exclusion chromatography to isolate protein complexes and conventional LC MS/MS to assess their protein composition. The SEC method they employ worked very well and it proved to be an excellent method to produce functionally-related protein fractions. The work was technically first rate and hopefully it will popularize the use of modern SEC methods in proteomics sample preparation.
Data set of the week: (2014/11/17)
Rapid and Deep Proteomes by Faster Sequencing on a Benchtop Quadrupole Ultra-High-Field Orbitrap Mass Spectrometer.
Overall rating: very good data (specialist interest)
This data set consisted of 36 results, using reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD001305 . It has been published by Kelstrup CD, Jersie-Christensen RR, Batth TS, Arrey TN, Kuehn A, Kellmann M and Olsen JV, J Proteome Res. 2014 Nov 10 (PubMed).
The goals and results of this study are remarkably similar to those in last week's featured data set Scheltema, et al.. Many of the details are different, caused by significant differences in the chromatographic methods used in the two studies. This study utilized a gradient that rapidly rose to ~20% organic and then slowly increased to ~ 45%, followed by a isocratic hold for ~8000 scans. The Scheltema results used a linear gradient from ~5% to ~40% with a short additional gradient to ~55% organic at the end. The method used in the Kelstrup study resulted in a nearly constant peptide identification rate of ~50% throughout the main gradient, falling off sharply during the isocratic portion. The Scheltema study showed a more variable identification rate, starting at ~20% and rising to as much as 70% by the end. Overall the two studies show about the same "depth" in terms of identifications, although the Scheltema study had significantly better identifications for the human Alphapapillomavirus 7 proteins in their HeLa cell line. The overall efficiency of peptide identification was slightly better in the last week's study: 4.9 unique residues per spectrum (Scheltema) compared to 4.3 unique residues per spectrum (Kelstrup).
Data set of the week: (2014/11/11)
The Q Exactive HF, a Benchtop Mass Spectrometer with a Pre-filter, High Performance Quadrupole and an Ultra-High Field Orbitrap Analyzer.
Overall rating: very good data (specialist interest)
This data set consisted of 99 results, using reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD001203. It has been published by Scheltema RA, Hauschild JP, Lange O, Hornburg D, Denisov E, Damoc E, Kuehn A, Makarov A and Mann M, Mol Cell Proteomics. 2014 Oct 30 (PubMed).
This study provides a set of rather straightforward analyses of HeLa cell lysates using LC/MS/MS performed using the new Q Exactive HF instrument. The results should be of interest to anyone interested in obtaining or using this MS/MS platform. The LC/MS/MS system performed very well, generating reproducible results with good sensitivity. The results show good identifications of the E7:p protein (human Alphapapillomavirus 7) and EIF5A:pm.K50+hypusine, which are reliable indicators of "depth" in HeLa cell studies. This instrument configuration generates significantly more identifiable peptide fragment ions than the previous generation instruments, making the accurate assignment of PTMs more reliable.
October 2014 Editions of the Mouse and Human Proteome Guides (2014/10/6)
The latest edition (v. 16) of the Guide to the Human Proteome and the Guide to the Mouse Proteome have been released and are available for download and use. They are both available in either HTML, CSV (comma-separated value) or XLS (excel spreadsheet) formats. This release will be the last one to use ENSEMBL 70 for human and ENSEMBL 69 for mouse proteomes: the January 2015 release will use ENSEMBL 76 for both human and mouse sequences.
Search engine basics 101 #4, by Ron Beavis (2014/09/25)
Chromatographic influences on peptide identification rate calculations.
It is very common to analyze the set of spectra generated by an HPLC MS/MS experiment as a group, thereby obtaining an ordered set of peptide-to-spectrum matches (PSMs). The matches are then examined and statistical QA/QC measures applied, resulting in a reduced set of PSMs assumed to be true positive assignments. The efficiency of this process is often characterized by calculating the ratio of the number of true postive PSMs to the total number of MS/MS spectra acquired. This ratio (R) is also frequently used to characterize the performance of one identification algorithm versus others more ...
Data set of the week: (2014/9/15)
Uncovering global SUMOylation signaling networks in a site-specific manner.
Overall rating: excellent data (leading the field)
This data set consisted of 37 results, using affinity pulldown followed by reversed phase HPLC MS/MS. The data files were made available through ProteomeXchange, PXD001061. It has been published by Hendriks IA, D'Souza RC, Yang B, Verlaan-de Vries M, Mann M and Vertegaal AC, Nat Struct Mol Biol. 2014 Sep 14 (PubMed).
Through a clever, well thought-out experimental scheme, this group has made it possible to study human protein SUMOylation on a large scale, with the same level of sensitivity and precision as is available for ubiquitylation. Their methods and results will set the standard for devising studies to investigate the biological function of this intriguing, reversable post-translational modification. They show quite convincingly that there are a wider range of lysine acceptor sites available for SUMOylation than only those predicted by the canonical motif φ-K-X-[DE], where φ is a hydrophobic residue and K is the acceptor site. Some proteins are shown to have many modifiable lysines. The results also demonstrate a considerable overlap between previously observed ubitinylation sites and this set of SUMOylation sites. The sample preparation, chromatography and mass spectrometry were all first rate and their SUMO-site sequence tag generated easy to identify peptides.
Homo sapiens and Mus musculus sequence and variant updates (2014/8/28)
The protein sequences used by GPM's public search sites have been updated to the most recent human and mouse ENSEMBL releases, H. sapiens v.76 (Genome Reference Consortium Human Build 38, GENCODE 20) and M. musculus v.76 (Genome Reference Consortium Mouse Reference 38 p.2, GENCODE M3). The matched single amino acid variants listings for each species have also been updated to the most recent set of non-synonymous single nucleotide variants available from ENSEMBL BioMart. The proteotypic profile files for X! P3 and annotated spectrum library files for X! Hunter have also been updated to correspond to these new protein sequence sets.
Search engine basics 101 #3, by Ron Beavis (2014/08/27)
Using parent ion mass accuracy histograms to understand deamidation.
Intact, folded proteins are relatively stable to environmental conditions, but they are slowly degraded by non-enzymatic chemical reactions, such as the oxidation of methionine or tryptophan residues or the hydrolysis of Asp-Pro bonds. However, once a protein has been cleaved into peptides in preparation for proteomics analysis, all of these degradation processes speed up due to the removal of the steric constrains provided by a folded protein's 2° and 3° structure. One of the degradation reactions that becomes particularly prominent in digested peptides is the conversion of Gln to Glu and Asn to Asp by the deamidation reaction (Yang et al.). This reaction results in a peptide mass change of 0.98402 Da, which is easy to detect using high resolution mass spectrometry more ...
Search engine basics 101 #2, by Ron Beavis (2014/08/17)
Using parent ion mass accuracy to evaluate subpopulations found in peptide MS/MS data.
When developing peptide identification algorithms one of the biggest problems is trying to evaluate the effects of changing some feature of the algorithm or its input parameters. You want to maximize the number of true positive identifications without adding false positives, but often it is difficult to be sure that you have achieved that end. The target-decoy simulation method is commonly used to assist making this type of decision. However, this single simulation can be difficult to interpret when a data set is composed of multiple distributions more ...
Search engine basics 101 #1, by Ron Beavis (2014/08/11)
An example of why simple ion counting is not commonly used as a score in proteomics search engines.
Recently, the idea has been put forward that simple "ion counting" can be used as a practical, reliable scoring system for proteomics software tasked with identifying peptides based on tandem mass spectra (Wenger CD, et al., and Zhang B, et al.). The algorithm associated with this scoring system is simpler than those employed by the existing search engines (e.g., Mascot, X! Tandem, OMSSA, etc.), making practical implementation easier and the results more immediately comprehensible more ...
Data set of the week: (2014/7/28)
Extracellular matrix signatures of human primary metastatic colon cancers and their metastases to liver.
Overall rating: very good data (general interest)
This data set consisted of 176 results, using multidimensional chromatography (off-gel electrophoresis followed by HPLC) prior to mass spectrometry. The data files were made available through Massive, MSV000078555. It has been published by Naba A, Clauser KR, Whittaker CA, Carr SA, Tanabe KK, RO Hynes, BMC Cancer. 2014 Jul 18;14(1):518 (PubMed).
This data comes from a well done study of the differences in extracellular matrix associated with colon cancer, comparing the extracellular matrix found in normal colon tissue, cancerous tissue and metastatic colon cancer tissue found in the liver. All of the data was collected from clinical samples and it nicely demonstrates the type of data and protein identifications that can be obtained from this type of very heterogenous, largely non-cellular tissue extract. The methods used seemed to work quite well, although the data shows some difficulties with maintaining the parent ion mass calibration — a common problem with the last few generations of Orbitrap-based instruments.
Data set of the week: (2014/7/19)
Development and performance evaluation of an ultra-low flow nano liquid chromatography-tandem mass spectrometry set-up.
Overall rating: excellent data (leading the field)
This data set consisted of 39 results, exploring the role of HPLC flow rate and analysis duration in LC/MS/MS measurements. The data files were made available through ProteomeXchange, PXD000396. It has been published by Köcher T, Pichler P, Pra MD, Rieux L, Swart R, and K Mechtler, Proteomics. 2014 Jun 11 (PubMed).
This data set represents a tour de force exploring the relationship between chromatographic methods and proteomics results. This group has achieved a degree of reproducibility and quality control using their nanoLC system that clearly leads the field. From a technical point of view, many of the LC/MS/MS runs (e.g., 120312QEx2_RS1_20nl-min_0k1HeLa_14h_01.msf) were simply the best we've ever seen in 10 years of operation. Anyone interested in studying the relationship between quality parameters — dynamic range, LOD or LOQ — and the number of spectra acquired should examine this data set carefully.
GO annotation listing build complete for three model species (2014/7/9)
Following the usual quarterly update of the human and mouse proteome guides (v. 15), the listings for human, mouse and yeast GO code annotations were also rebuilt. The resulting files contain the proteins associated with a specific GO code, the number of times the protein has been observed and the usual GPMDB evidence code for these proteins. The GO codes available are indexed and the individual listings are available from the main GPMDB site in three text formats:
  1. H. sapiens, 11,958 GO categories;
  2. M. musculus, 11,339 GO categories; and
  3. S. cerevisiae, 4,309 GO categories.
The text files and HTML indexes are also available for download via FTP.
Data set of the week: (2014/7/7)
Proteomic analysis of the multimeric nuclear egress complex of human cytomegalovirus
Overall rating: very good data (general interest)
This data set consisted of 24 results, using LC/MS/MS to probe the consequences siRNA gene silencing experiments. The data files were made available through ProteomeXchange, PXD000536. It has been published by Milbradt J, Kraut A, Hutterer C, Sonntag E, Schmeiser C, Ferro M, Wagner S, Lenac T, Claus C, Pinkert S, Hamilton ST, Rawlinson WD, Sticht H, Coute Y and Marschall M, Mol Cell Proteomics. 2014 Jun 26 (PubMed).
Human cytomegalovirus — a.k.a, HCMV, CMV and Human herpesvirus 5 — infections are extremely common (> 50% of the population). The virus does not produce clinical symptoms in most of the infected, but it remains dormant for long periods of time and can result in serious disease in immuno-compromised individuals. It can also be passed from the mother to fetus and give rise to developmental abnormalities. This study does a good job of demonstrating the utility of combining proteomics and siRNA techniques for the study of viral protein production. The sample preparation, chromatography and mass spectrometry are well done. Any group interested in studying viral dynamics in host cells using proteomics should take a look at the methods used to generate this data set and the results obtained from these studies.
Data set of the week: (2014/6/29)
Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives.
Overall rating: very good data (general interest)
This data set consisted of 249 results, using affinity pull-down sample preparation and LC/MS/MS analysis with SILAC quantitation. The data files were made available through ProteomeXchange, PXD000143. It has been published by Spruijt CG, Gnerlich F, Smits AH, Pfaffeneder T, Jansen PW, Bauer C, Münzel M, Wagner M, Müller M, Khan F, Eberl HC, Mensinga A, Brinkman AB, Lephikov K, Müller U, Walter J, Boelens R, van Ingen H, Leonhardt H, Carell T and Vermeulen M, Cell. 2013 Feb 28;152(5):1146-59 (PubMed).
This study utilizes the same methods commonly used to determine protein-protein interactions to determine which proteins have a special affinity for DNA containing 5-methylcytosine, 5-(hydroxy)methylcytosine, 5-formylcytosine and 5-carboxylcytosine. The experiments were performed using mouse embryonic stem cells as the source of potential interactor proteins. The experiments were consistently done and the analysis was of very good quality. The proteins selected showed considerable enrichment of those known to be part of the nucleolus, nucleus, ribonucleoprotein complex, ribosome and spliceosome. This method of sample preparation produced many of the best observations of comparatively rare gene products, such as Pcgf1:p, Aurkc:p and Gm5590:p.
Data set of the week: (2014/6/23)
The Global Phosphoproteome of Chlamydomonas reinhardtii Reveals Complex Organellar Phosphorylation in the Flagella and Thylakoid Membrane.
Overall rating: excellent data (worth study)
This data set consisted of 6 results, using a multi-step phosphopeptide enrichment strategy followed by a multidimensional chromatography separation using a HILIC initial separation and subsequent reversed-phase HPLC. The data files were made available through ProteomeXchange, PXD000783. It has been published by Wang H, Gau B, Slade WO, Juergens M, Li P and Hicks LM, Mol Cell Proteomics. 2014 Jun 10 (PubMed).
Chlamydomonas reinhardtii is a widely used model algae species. It is unicellular with two flagella and it is capable of photosynthesis. The organism is commonly found in the environment and it can be grown under very minimal conditions compared to most eukaryotes. This study used global phosphoproteomics methods to determine how the organism utilizes protein phosphorylation in its metabolic processes. The results showed good enrichment of phosphopeptides (> 70% of identified spectra). The ratio of S:T phosphorylation was a little lower than many other eukaryotes (about 4:1), but the degree of proline-directed phosphorylation detected was noticeably less than normally found in mammalian studies. The data quality was excellent and these spectra would be suitable for developing algorithms or testing computational biology methods for phospho-protein biology.
Data set of the week: (2014/6/15)
A Candida albicans PeptideAtlas.
Overall rating: excellent data (leading the field)
This data set consisted of 148 results, from 16 distinct experiments. The data files were made available through PeptideAtlas, PASS00402, PASS00408, PASS00476, and PASS00447. It has been published by Vialas V, Sun Z, Loureiro y Penha CV, Carrascal M, Abián J, Monteoliva L, Deutsch EW, Aebersold R, Moritz RL, and Gil C, J Proteomics, 2014 Jan 31;97:62-8 (PubMed).
Candida albicans is a fungus that can exist either as single cells or filaments. It is a commensal organism in H. sapiens, occupying the oral cavity and gastrointestinal tract in most of the population. C. albicans can also cause a variety of infections —particularly in oral and gential tissues — in immunocompromised individuals. It belongs to a large group of fungi, the mitosporic Saccharomycetales, that that contains many human pathogenic organisms. Unfortunately, these fungi have not had much attention from the proteomics community. This dataset starts to correct this problem, defining the observable peptides and proteins from C. albicans samples under a variety of experimental conditions. The sample preparation and separations were very well done and the mass spectrometry was state-of-the-art.
Data set of the week: (2014/6/7)
Functional annotation of proteome encoded by human chromosome 22; and
A draft map of the human proteome.
Overall rating: excellent data (general interest)
This data set consisted of 84 results, each one a summary of individual LC/MS/MS runs associated with multidimensional chromatography analyses of individual tissue samples. The data files were made available through ProteomeXchange, PXD000561. It has been published by Pinto SM, Manda SS, Kim MS, Taylor K, Selvan LD, Balakrishnan L, Subbannayya T, Yan F, Prasad TS, Gowda H, Lee C, Hancock WS, and Pandey A, J Proteome Res. 2014 Jun 6;13(6):2749-60 (PubMed).
This set of data was one of the first attempts to broadly sample human tissues using similar experimental methods for each sample. It contained some of the first publicly available data for several tissues, in some cases from both fetal and adult samples. Analysis of the data produced numbers of protein identifications typical for the methods used, although the results for some tissues (e.g., liver, heart) were surprisingly variable. Overall, the chromatography and mass spectrometry were well done and consistent between samples. There was considerable variablity between the samples with respect to the presence of detectable experimental artifacts caused by the modification of free peptide amines: both N-terminal and lysine side chain amines were either carbamylated or carboxyamidomethylated to a significant extent. These artifacts made the data of limited use for detecting some modifications — particularly acetylation or ubiquinatiion — or amino acid polymorphisms. Other modifications that were not easily confused with these artifacts were present and available for interpretation. For example, differences in the hydroxyproline distributions on many collagen subunits could be readily observed in different tissues. The phosphorylation states of some common proteins could also be readily observed across multiple tissues.
The Contest: testing large-scale proteomics information systems
In addition to the publication listed above, this data was also the basis for "A draft map of the human proteome", describing the web site www.humanproteomemap.org. The purpose of the web site was to allow researchers to enter a list of gene symbols and then display the relative amount of the associated protein that was detected in each of the tissues examined. The data was analyzed using methods commonly used for small, single LC/MS/MS runs applied to these much larger data sets.
One of the best ways to evaluate this type of informatics system is to perform "sanity" tests to see how well the output of the system corresponds to known patterns of protein expression. Since evaluating this type of system is an important skill for anyone who wants to be involved in large-scale proteomics, we thought it would be an excellent subject for a contest. Two lists of genes were selected to probe the quality and utility of the proteomics information available and the results of querying www.humanproteomemap.org with these lists were downloaded as PDF files:
Everyone with an interest in the subject is invited to take a look at these two results and write a 250 word essay on their implications for the biological, technical and biomedical utility of the web site's information. The best essay will be published on this blog (anonymously if you prefer) and the author will recieve a beautiful GPMDB T-shirt. Submit your entries by email to contact@thegpm.org, using the subject line "GPMDB T-Shirt contest". Please stick to the facts as much as possible: sarcasm, irony or ad hominem comments will count against any entry. Entries may be submitted until midnight July 1, 2014. Multiple entries from the same individual are allowed, but the author must clearly identify themselves in the email. The winner will be announced July 7, 2014.
Data set of the week: (2014/5/25)
Virion proteome of Cafeteria roenbergensis virus strain BV-PW1.
Overall rating: excellent data (general interest)
This data set consisted of 10 results, consisting of 9 gel bands and a summary set of identifications. The data files were made available through ProteomeXchange, PXD000993. It has been not yet been published, but was submitted by Matthias Fischer (Max Planck Institute for Medical Research) and Leonard Foster (University of British Columbia).
This elegant dataset neatly wraps up the preliminary work on the proteome of a recently discovered nucleocytoplasmic large DNA virus, the Cafeteria roenbergensis virus. The host species, Cafeteria roenbergensis, is a marine flagelate that consumes bacteria in coastal water. The virus has a very large genome of about 730,000 base pairs of dsDNA and 1,096 predicted proteins. The virus is also large enough that it can be infected by a virophage, the Mavirus. Not only is the virus biologically interesting, but the data is one of the best we've run across for testing peptide identification algorithms and the theory behind them. The chromatography and mass spectrometry were both very well done and the spectra are ideal for detecting common artifactual modifications that can be masked by dodgy experimental technique, such as deamidation and peptide N-terminus cyclization. It is also useful for trying to understand how to think about the problem of balancing sensitivity versus selectivity and false positive versus false negative assignments.
Peptides galore (2014/5/25)
Yesterday GPMDB registered its 1,500,000,001st peptide identification. We would like to take the achievement of this self-imposed, completely arbitrary milestone as an opportunity to announce the availablity of a new service that allows users to obtain the complete compliment of identified peptides for individual species — http://peptides.thegpm.org. These peptide lists are currently available for 16 eukaryote, 17 virus and 18 eu-/archae-bacterial species. The peptide sequences are provided as a tab-separated value spreadsheet, with information about the number of observations, charge states and minimum observed E-values for the peptides. The peptides were assigned to the species in question: there has been no attempt to map peptides identified in one species onto another species' proteins. The lists are obtained directly from GPMDB, so the peptide distributions may change as new information is registered during the daily incremental build of the database.
Data set of the week: (2014/5/16)
High-Confidence Glycosome Proteome for Procyclic Form Trypanosoma brucei by Epitope-Tag Organelle Enrichment and SILAC Proteomics.
Overall rating: very good data (specialist interest)
This data set consisted of 154 results, obtained from LC/MS/MS analyses of procyclic trypanosomes using SILAC for quantitation. The data files were made available through ProteomeXchange, PXD000663. It has been published by Güther ML, Urbaniak MD, Tavendale A, Prescott A and Ferguson MA, J Proteome Res. 2014 May 13 (PubMed).
The trypanosome Trypanosoma brucei is a human pathogen that causes African trypanosomiasis (sleeping sickness). It is a eukaryote with a complex life cycle involving insect and mammalian hosts. The procyclic form that is the focus of this study is found in the midgut of the insect host. The data here gives excellent insight into the procyclic glycosome — a subcellular organelle that contains glycolytic enzymes and glycogen. The analysis of the data showed significant and variable degrees of peptide carbamylation, an increasingly common problem as more groups adopt the FASP sample preparation method.
Network outage (2014/5/15)
The internet provider for one of our data centres changed the range of IP addresses associated with many of our services last night, without prior notification. The necessary updates to our DNS servers have been made, but it can take up to 24 hours for these changes to fully penetrate the global DNS system. This long update period means that some of our servers may appear to be off-line until tomorrow for some users.
Data set of the week: (2014/5/10)
TCGA Breast Cancer Characterization.
Overall rating: excellent data (leading the field)
This data set consisted of 1265 results, obtained from multidimensional chromatography LC/MS/MS analyses of breast tissue samples, collected for the parallel TCGA Cancer Genome Atlas study. The data files were made available at the CTPAC data portal. The results have not yet been published, but it was generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).
The Clinical Proteomic Tumor Analysis Consortium, sponsored by the US National Cancer Institute, has been developing the methodology for the reproducible proteomics analysis of tumor tissue for about 8 years. The studies that they have released have demonstrated an increasingly nuanced approach to the problems associated with technologies involved in this type of analysis. This new data set is the best one to date, showing a degree of depth and reproducibility that no other group has been able to achieve in proteomics. Given the questionable state of the associated genome mapping project, these proteomics results (and the previous colorectal tumor data set) are probably the most valuable output from the TCGA sample collection process to date.
Data set of the week: (2014/4/14)
Comparison of two phenotypically distinct lattice corneal dystrophies caused by mutations in the transforming growth factor beta induced (TGFBI) gene.
Overall rating: very good data (specialist interest)
This data set consisted of 10 results, obtained from LC/MS/MS analyses of tissue samples. The data files were made available through ProteomeXchange, PXD000307. It has been published by Poulsen ET1, Runager K, Risør MW, Dyrlund TF, Scavenius C, Karring H, Praetorius J, Vorum H, Otzen DE, Klintworth GK and Enghild JJ, Proteomics Clin Appl. 2013 Dec 2 (PubMed).
This data provides some of the best insight to-date on the major components of human corneal tissue. The tissue sampling and experimental workflow produced very good reproducibility in the lists of detected peptides and proteins. Any group interested the the major proteins present in the cornea or their post-translational modifications should study this data in depth prior to performing their own experiments.
April 2014 Editions of the Mouse and Human Proteome Guides (2014/4/2)
The latest editions of the Guide to the Human Proteome and the Guide to the Mouse Proteome have been released and are available for download and use. They are both available in either HTML, CSV (comma-separated value) or XLS (excel spreadsheet) formats.
The following chart shows the status of Homo sapiens protein-coding splice variants in the current Guide to the Human Proteome:
The histogram bars are stacked plots of the fraction of protein-coding splice variants observed on each chromosome and the colors represent the splice variant sequences classified by evidence code. Black (EC 1) indicates the fraction of splice variants for which there have been no peptides observed for a splice variant sequence with an E-value ≤ 0.01. Red (EC 2) is the fraction of variants with at least one peptide observed with an E-value ≤ 0.01. Yellow (EC 3) is the fraction with at least one peptide that has been observed multiple times and those observations pass one of two tests for deterministic behavior. Green (EC 4) is the fraction where at least one peptide has been observed multiple times and passes both tests for deterministic behavior.
The same plot from Guide to the Mouse Proteome shows that while the overall number of splice variant assignments for mouse are lower, the same general trends are present:
Data set of the week: (2014/4/2)
In vivo SILAC-based proteomics reveals phosphoproteome changes during mouse skin carcinogenesis.
Overall rating: very good data (specialist interest)
This data set consisted of 315 results, obtained from SDS-PAGE gel bands, metal-oxide affinity fractionation and multi-dimensional LC/MS/MS analyses using SILAC quantitation. The data files were made available through ProteomeXchange, PXD000821. It has been published by Zanivan S, Meves A, Behrendt K, Schoof EM, Neilson LJ, Cox J, Tang HR, Kalna G, van Ree JH, van Deursen JM, Trempus CS, Machesky LM, Linding R, Wickström SA, Fässler R, and Mann M, Cell Rep. 2013 Feb 21;3(2):552-66 (PubMed).
The data associated with this study provides some of the best evidence about the proteins present in Mus musculus skin tissue. Skin is an under-studied tissue in proteomics, even though it is abundant, relatively easy to sample and clinically important. The data from this study showed good reproducibility and attention to detail in both the sample preparation and chromatography. The analysis in the manuscript was significantly flawed because of a failure to consider the modifications present in collagen (the most abundant protein in skin), but that does not take away from the value of the data itself as a good example of what can be observed from skin tissue.
Data set of the week: (2014/3/23)
Coordinated activation of PTA-ACS and TCA cycles strongly reduces overflow metabolism of acetate in Escherichia coli.
Overall rating: excellent data (leading the field)
This data set consisted of 10 results, obtained from LC/MS/MS analysis. The data files were made available through ProteomeXchange, PXD000556. It has been published by Peebo K, Valgepea K, Nahku R, Riis G, Oun M, Adamberg K and Vilu R, Appl Microbiol Biotechnol. 2014 Mar 15 (PubMed).
The proteomics group at the Competence Center of Food and Fermentation Technologies at the Tallinn University of Technology has been one of the top performers in terms of data quality for several years and they do not disappoint with this data set. This group has developed into one of the few labs that can genuinely produce results demonstrating high run-to-run reproducibility in the analysis of complex samples. This set of MS/MS analyses would be an excellent choice for any group interested in the practical limits associated with replicate analysis in proteomics.
Data set of the week: (2014/3/13)
Functional analysis of novel Rab GTPases identified in the proteome of purified Legionella-containing vacuoles from macrophages.
Overall rating: excellent data (leading the field)
This data set consisted of 120 results, obtained from LC/MS/MS analysis of excised SDS-PAGE gel bands. The data files were made available through ProteomeXchange, PXD000647. It has been published by Hoffmann C1, Finsel I, Otto A, Pfaffinger G, Rothmeier E, Hecker M, Becher D and Hilbi H, Cell Microbiol. 2013 Dec 26 (PubMed).
This well planned and executed study examines the host effects associated with the opportunistic pathogen Legionella pneumophila, which causes the life-threatening pneumonia commonly referred to as Legionnaires' disease. These experiments focus on understanding the "Legionella-containing vacuole", a structure formed by the organism in the host cell that is used to facilitate replication. By isolating these vacuoles in two very different eukaryotic systems (Mus musculus and the amoeboid form of Dictyostelium discoideum), the study was able to demonstrate what systems the pathogen is using to form and maintain this structure. The proteomics data is of excellent quality and would be ideal to use as a case study in multi-species data analysis and biological interpretation.
Data set of the week: (2014/3/5)
Thirty-thousand-year-old distant relative of giant icosahedral DNA viruses with a pandoravirus morphology.
Overall rating: very good data (specialist interest)
This data set consisted of 1 result, obtained from a single LC/MS/MS analysis. The data files were made available through ProteomeXchange, PXD000460. It has been published by Legendre M, Bartoli J, Shmakova L, Jeudy S, Labadie K, Adrait A, Lescot M, Poirot O, Bertaux L, Bruley C, Couté Y, Rivkina E, Abergel C and Claverie JM, Proc Natl Acad Sci U S A. 2014 Mar 3 (PubMed).
This study involves the characterization of the giant virus Pithovirus sibericum, a 1.5 micron long amphora-shaped viron. The virus was isolated from a 30,000 year old sediment and grown in Acanthamoeba castellanii. The virus has a 600 kilobase genome, with open reading frames for approximately 2,500 proteins. The proteomics study made available was obtained from isolated viron particles, with 70 identified proteins from the host A. castellanii and 193 proteins from P. sibericum. While the peptides showed a considerable amount of non-tryptic cleavage, the mass spectrometry and chromatography were both very well done.
Changing the naming convention for amino acid polymorphisms (2014/2/4)
GPM and GPMDB have been acquiring information about amino acid polymorphisms since the system began operating in 2004. The process accelerated sigificantly with the introduction of dbSNP annotation information in 2006. Nucleotide polymorphism research has advanced tremendously during this period, to the point that the original term "polymorphism" no longer accurately describes the phenomena being studied. The term SNP has been largely replaced by SNV (Single Nucleotide Variant) to reflect the changes in the field. To keep up with these changes, GPM and GPMDB will be altering all references to SNPs and SNAPs (Single Nucleotide-induced Amino acid Polymorphisms) to SNVs and SAVs (Single Amino acid Variants). The name of the server used to provide the interface for GPMDB's collected SAV information with remain stable at snap.thegpm.org, but the alias sav.thegpm.org will be added for forward compatibility.
System updates for GPMDB's 10th anniversary (2014/2/4)
GPMDB had its tenth anniversary of operation on Jan. 1, 2014: the public interface was first made available on Jan. 1, 2004. The overall success of the project has made it necessary to invest in updating the hardware and software resource that run GPMDB on a daily basis. Today marks the end of this upgrade cycle, with the successful completion of adding a new, faster server for processing incoming data files into database entries. The following items have been added/upgrading during the process:
  1. a new server has been added, dedicated to processing REST information requests (rest.thegpm.org);
  2. 30 TB of disk storage has been added to the system, allowing for significantly greater data volume and backup capabilities;
  3. new solid-state drives have been added to the publicly available system, to increase capacity, speed up queries and reduce cost;
  4. the data file processing server has replaced, with a tested capacity of > 6 billion new identifications per year;
  5. 30 GB of memory has been added to the pool available for user queries; and
  6. all software platforms (e.g., PERL, MySQL) have been updated to the latest stable versions available.
Data set of the week: (2014/2/2)
Glomerular Cell Cross-Talk Influences Composition and Assembly of Extracellular Matrix.
Overall rating: very good data (specialist interest)
This data set consisted of 180 results, each an LC/MS/MS analysis of an SDS-PAGE gel band. The data files were made available through ProteomeXchange, PXD000643. It has been published by Byron A, Randles MJ, Humphries JD, Mironov A, Hamidi H, Harris S, Mathieson PW, Saleem MA, Satchell SS, Zent R, Humphries MJ, and Lennon R, J Am Soc Nephrol. 2014 Jan 16 (PubMed).
This study focussed on an often ignored tissue compartment: the clumsily-named "extracellular matrix". While this essential component of tissue varies widely in protein composition and it is essential for organ function, the proteomics community has spent comparatively little effort characterizing the associated tissue-specific proteomes. The data reported in this manuscript uses standard methods to investigate the proteins of glomerular extracellular matrix, providing a good insight into its composition.
Data set of the week: (2014/1/26)
Proteomic analysis of purified protein derivative of Mycobacterium tuberculosis.
Overall rating: very good data (general interest)
This data set consisted of 1 result, a single injection LC/MS/MS experiment. The data file was made available through ProteomeXchange, PXD000377. It has been published by Prasad TS, Verma R, Kumar S, Nirujogi RS, Sathe GJ, Madugundu AK, Sharma J, Puttamallesh VN, Ganjiwale A, Myneedu VP, Chatterjee A, Pandey A, Harsha H, and Narayana J, Clin Proteomics. 2013 Jul 19;10(1):8 (PubMed).
While many published 'omics studies focus on the heroic collection of large volumes of data, this study is more of a haiku: a quiet reflection on an important clinical material. By limiting the study to simply looking at the real composition of "Purified Protein Derivative" (the antigenic material used for the tuberulosus skin test), the authors clearly demonstrate both the power of the now-routine techniques employed and beg the question of why this type of analysis is not available for every batch of this product used clinically.
Data set of the week: (2014/1/19)
Comparative Proteome Analysis Revealing an 11-Protein Signature for Aggressive Triple-Negative Breast Cancer.
Overall rating: excellent data (leading the field)
This data set consisted of 126 results, each one a 3 hour gradient LC/MS/MS experiment from laser microdisected samples. The data files were made available through PeptideAtlas, PASS00260. It has been published by Liu NQ, Stingl C, Look MP, Smid M, Braakman RB, De Marchi T, Sieuwerts AM, Span PN, Sweep FC, Linderholm BK, Mangia A, Paradiso A, Dirix LY, Van Laere SJ, Luider TM, Martens JW, Foekens JA and Umar A, J Natl Cancer Inst. 2014 Jan 7 (PubMed).
This study represents probably the best clinical proteomics data set obtained from laser microdisection samples. The starting material used in each analysis was approximately 4,000 human breast cancer epithelial cells removed from frozen tissue samples. The resulting set of spectra and identifications were surprisingly consistent, producing very few experimental artifacts as well as excellent reproducibility of the HPLC-MS profiles and peptide identifications. Any group interesting in trying to find rare post-translational modifications, single amino acid variants or simply understanding the limits of reproducibililty in this type of experiment should consider using this data set.
Data set of the week: (2014/1/10)
Interaction proteome of human Hippo signaling: modular control of the co-activator YAP1.
Overall rating: very good data (general interest)
This data set consisted of 96 results, corresponding to protein-protein interaction pull-down experiments. The data files were made available through PeptideAtlas, PASS00281. It has been published by Hauri S, Wepf A, van Drogen A, Varjosalo M, Tapon N, Aebersold R and Gstaiger M, Mol Syst Biol. 2013 Dec 22;9(1):713 (PubMed).
This study uses now conventional methods for establishing the protein-protein interaction partners for proteins involved in the Hippo pathway. This pathway regulates the size of many internal organs/tissues by exerting control over the rates of apoptosis and cell proliferation. The experiments were well done and the results showed many very good identifications of both rare and common proteins. This data set would make an excellent choice for developing a workshop or course project on the analysis of protein-protein interaction results.
Copyright © 2014, The Global Proteome Machine Organization