The GPM News Archive, 2010

The Global Proteome Machine Organization

News Archive

2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010
2009 | 2008 | 2007 | 2006 | 2005 | 2004

Data sets of the year: (2010/12/26)
Technical, Biological and Clinical.

This week we are awarding the title "Data set of the year" to three outstanding examples of publicly available proteomics experimental data. These awards are in three categories:

Technical data: Nagaraj N, et al.
Feasibility of large scale phosphoproteomics with HCD fragmentation.
This data is the most convincing evidence yet that HCD is in the process of revolutionizing the experimental and instrumental requirements for top quality proteomics.
Biological data: Merrihew GE, et al.
Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations.
A clear demonstration that the interaction between protein-level and genome-level experimental data is valuable, even in very well studied model species.
Clinical data: Drake RR, et al.
In-depth proteomic analyses of direct expressed prostatic secretions.
The very model of a well-designed proteomics clinical study.

Update of the M. musculus proteome (2010/12/20)

The ENSEMBL proteome for M. musculus (mouse) has been updated to the latest version (ENSEMBL v.60, NCBIm37). The proteome is broken down into an individual file for each chromosome (1-19, X, Y, MT and other), each of which is also available by FTP. This division of protein sequences makes it possible to create individual proteomes for male and female, which can be searched individually on any of the GPM public servers.

The individual Annotated Spectrum Libraries for each human and mouse chromosome have been made available in the human_chromosomes and mouse_chromosomes directories of the GPM FTP site. The P3 algorithm peptide contigs are also available via FTP.

Data set of the week: (2010/12/19)
An Expanded Oct4 Interaction Network: Implications for Stem Cell Biology, Development, and Disease.

This study contains 7 LC/MS/MS runs, from pull-down experiments. The manuscript describing this work was published by Pardo M, Lang B, Yu L, Prosser H, Bradley A, Babu MM, and Choudhary J, Cell Stem Cell. 2010 6:382-95 (PubMed).

This study contains very high-quality pull-down results that represent rarely observed Mus musculus proteins and peptides. Unfortunately, the original data was not made publicly available: only spectra that resulted in identifications were stored in PRIDE. Hopefully the authors will make the original data available at some point so that a more thorough analysis can be performed.

Nota bene: In looking through these results, some may notice that there was no observation of a protein named "Oct4". This seemly odd fact was due to the confusing nature of protein naming: "Oct4" is not a currently accepted name for any mouse protein. The current name for that gene product is "Pou5f1" (POU domain, class 5, transcription factor 1), corresponding to ENSMUSP00000025271. Inspection of the current observations show clearly that this protein has been over-represented in samples coming from mouse embryonic stem cells.

Update of the H. sapiens proteome (2010/12/16)

The ENSEMBL proteome for H. sapiens (human) has been updated to the latest version (ENSEMBL v.60, GrCH37.p2). For the first time, the proteome is broken down into an individual file for each chromosome (1-22, X, Y, MT and other), each of which is also available by FTP. This division of protein sequences makes it possible to create individual proteomes for male and female, which can be searched individually on any of the GPM public servers.

Increased search capacity for "human" (2010/12/14)

Because of an increase in utilization of the "human" search server (human.thegpm.org), we have tripled the number of CPU's available for searches using the FASTA sequence lists available from that server.

Data set of the week: (2010/12/12)
Nucleosome-interacting proteins regulated by DNA and histone methylation.

This study contains 160 LC/MS/MS runs, grouped into sets of SDS-PAGE bands. The manuscript describing this work was published by Bartke T, Vermeulen M, Xhemalce B, Robson SC, Mann M, and Kouzarides T, Cell 2010 143:470-84 (PubMed).

This work demonstrates the extent to which SILAC quantitation has become a main stream technique in molecular biology. The study addresses a biologically important question, uses an exellent lab to perform the proteomics instrumental analysis and applies straightforward, established informatics methods to interpret the proteomics data in the context of the biological question.

Data set of the week: (2010/12/05)
Comparative shotgun proteomics using spectral count data and quasi-likelihood modeling.

This study contains 153 LC/MS/MS runs, grouped into sets of MudPit experiments. The analysis for each individual LC/MS/MS and summaries of the MudPit runs were recorded. The manuscript describing this work was published by Li M, Gray W, Zhang H, Chung CH, Billheimer D, Yarbrough WG, Liebler DC, Shyr Y, and Slebos RJ, J Proteome Res. 2010 9:4295-305 (PubMed).

While this set of data was generated for a specific statistical study, it also represents a very good resource for anyone interested in the study of signal analysis, bioinformatics or statistics as they relate to proteomics experimental analysis. The tissues selected were of clinical interest (head and neck carcinomas), the equipment was state-of-the-art and the experimental groups involved were first rate. Many data sets generated for bioinformatics analysis are not really representative of current best laboratory practices, but this one genuinely exceeds expectations.

Full database dumps available on TRANCHE (2010/12/03)

Periodically, we make the complete set of databases that make up GPMDB available on TRANCHE. The version now available was this week's backup, dated 2010/11/28, which you can download from this link. The FTP version of these files is also available from the GPM FTP site here. The files were generated using the mysqldump utility and they can be used to generate the three databases used to create GPMDB displays:

GPMDB - the main record of the proteomics data;
ENSPMAPDB - protein accession numbers and descriptions; and
PEAKDB - the database used to generate MRM and spectrum library information.

New GPMDB Wiki server (2010/11/30)

We are in the process of moving the GPMDB Wiki from its old server to a new one. The new server should be considerably faster than the old one and it has some features that make it easier for us to work with. It may take a day or two for the Internet DNS system to catch up with this change, so if you are adding material to the Wiki, please wait until Thursday to make your changes.

Data set of the week: (2010/11/28)
Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.

This study contains 28 tissue sample data sets. The manuscript describing this work was published by Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, and Baginsky S, Science 2008 320:938-41 (PubMed).

This work is still probably the most comprehensive proteomics study of Arabidopsis thaliana tissues available. Each of the individual samples corresponds to > 9,000 peptide identifications and > 1,000 non-redundant protein identifications. It can be used as a reliable catalogue of observable peptides and proteins for the corresponding A. thaliana tissues and cell-culture samples.

Data set of the week: (2010/11/21)
Prioritization of candidate protein biomarkers from an in vitro model system of breast tumor progression toward clinical verification.

This study contains 5 individual LC/MS/MS runs. The manuscript describing this work was published by Lau TY, Power KA, Dijon S, de Gardelle I, McDonnell S, Duffy MJ, Pennington SR, and Gallagher WM., J Proteome Res. 9(3):1450-9 (PubMed).

The data is a good example of what can be achieved using a QTOF-style instrument for analyzing gel bands. The relatively good resolution obtained on the fragment ions makes peptide identifications more positive (FDR ≈ 0.1%) and generally improves the confidence of the resulting protein identifications. The approach used in the paper has some merit for determining the suitability of proteins as biomarkers, although much of the comparitive work could have been done using existing databases of observable plasma and serum proteins.

GPMDB Guide to the Mouse Proteome (2010/11/17)

In response to the positive reception that the GPMDB Guide to the Human Proteome has received, we have constructed a similar set of spreadsheets for the Mus musculus proteome. You can access the GPMDB Guide to the Mouse Proteome as either an Excel spreadsheet or using a set of web pages that contain the same information, with some additional hyperlinks to the original data used to construct the Guide.

Data set of the week: (2010/11/14)
Proteomic Analysis of Human Nail Plate.

This study contains 40 individual LC/MS/MS runs. The manuscript describing this work was published by Rice RH, Xia Y, Alvarado RJ, and Phinney BS, J Proteome Res. 2010 Nov 1 (Epub ahead of print, PubMed).

The data investigates the proteins present in two common but sparsely investigated human tissues: hair and nail plate. These non-cellular tissues are composed mainly of high-sulphur (hard) keratins and keratin-associated proteins in different proportions. These proteins are unusually abundant on Chromosome 17, with more than 60 genes clustered between chromosome coordinates 38,810,917-39,780,829 (see the Human Proteome Guide for the gene names, positions and frequency of observation).

System problems because of power failure (2010/11/07)

A power failure at one of our main sites may lead to problems accessing some features of GPM until tomorrow morning (Pacific Time), Nov. 8.

Data set of the week: (2010/11/07)
Proteomic screen defines the Polo-box domain interactome and identifies Rock2 as a Plk1 substrate.

This study contains 24 individual result sets derived from SDS-PAGE gel bands. The manuscript describing this work was published by Lowery DM, Clauser KR, Hjerrild M, Lim D, Alexander J, Kishi K, Ong SE, Gammeltoft S, Carr SA, and Yaffe MB in EMBO J. 2007 26:2262-73 (PubMed).

This study demonstrates the power of protein affinity methods for enriching relatively rare, but biologically important proteins. The result sets contain many of the best identifications observed for proteins such as GRIPAP1, ROCK2, ANLN, EPB41L3, CLIP2 and the minichromosome maintenance complex. The methodology used here was relatively simple, but it revealed an interesting, high quality interactome that will take years of biological research to thoroughly investigate and understand.

Data set of the week: (2010/10/31)
Genome analysis and genome-wide proteomics of Thermococcus gammatolerans, the most radioresistant organism known amongst the Archaea.

This study contains 7 individual result sets; each set is the union of all spectra collected from a single SDS-PAGE gel. The manuscript describing this work was published by Zivanovic Y, Armengaud J, Lagorce A, Leplat C, Guérin P, Dutertre M, Anthouard V, Forterre P, Wincker P, and Confalonieri F. in Genome Biol. 2009;10(6):R70 (PubMed).

This study was a straightforward analysis of the proteome of a previously unexamined archaeon, T. gammatolerans. What set this study apart was the level of competence displayed by the research team in obtaining this data. The methodology used was straightforward, but they were able to consistently generate spectra good enough so that ~50% of the spectra resulted in high quality identifications. Generally, this type of strategy results in high levels of human keratins 1, 2, 9 and 10 identified, but not in this case. The data corresponded to >1000 T. gammatolerans proteins, with the largest of the individual gel sets having >60,000 identified peptides.

Data set of the week: (2010/10/24)
Feasibility of large scale phosphoproteomics with HCD fragmentation.

This study contains 25 individual samples, contrasting two methods for phophopeptide detection. The manuscript describing this work was published by Nagaraj N, D'Souza RC, Cox J, Olsen JV, and Mann M in J. Proteome Res. 2010 (Epub ahead of print, PubMed).

This data set is a major game-changer for any group interested in high-throughput phosphopeptide detection. The combination of HCD fragmentation with high accuracy parent and fragment ion mass measurement described in the associated publication result a level of sequence and PTM assignment accuracy that simply cannot be matched by the conventional CID approach using a low accuracy LTQ for fragment ion analysis. It is also clearly superior to ETD for high throughput phosphoproteomics: the physical chemistry of ETD make it much better suited to the detailed characterization of difficult cases rather than broad surveys of large mixtures.

Data set of the week: (2010/10/17)
Coupled global and targeted proteomics of human embryonic stem cells during induced differentiation.

This data set contains 18 sample analyses. The manuscript describing this work was published by Yocum AK, Gratsch TE, Leff N, Strahler JR, Hunter CL, Walker AK, Michailidis G, Omenn GS, O'Shea KS, and Andrews PC in Mol Cell Proteomics 2008 7:750-67 (PubMed).

This study utilizes MALDI TOF-TOF technology to provide an excellent survey of proteins in embryonic stem cells. While MALDI has become a secondary ionization method compared with electrospray, it still is a robust method for protein identification and it provides the most reliable source for library spectra of singly charge peptide ions.

Data set of the week: (2010/10/10)
Glycosylation signatures in Drosophila: fishing with lectins.

This data set contains 1 LC/MS/MS result. The manuscript describing this work was published by Vandenborre G, Van Damme EJ, Ghesquière B, Menschaert G, Hamshou M, Rao RN, Gevaert K, and Smagghe G. in J Proteome Res. 2010 9:3235-42 (PubMed).

A carefully selected set of lectins was used to purify glycoproteins by affinity capture from Drosophila melanogaster samples. The results show that this method was able to obtain an unusually high quality set of identifications for proteins of this species, as demonstrated by the very large fraction of "best ever" identifications for the proteins reported. The peptides identified also show significantly more chymotryptic peptide cleavage than would be typical for such a study.

Maintenance outage at UK ENSEMBL (2010/10/05)

Because of a maintenance issue at the UK ENSEMBL site, GPM has switched to using the USWest ENSEMBL site until the problem in England has been resolved. Operationally, the USWest site is a complete mirror of the UK site, so no changes in annotation or functionality should be caused by this switch-over.

Data set of the week: (2010/10/03)
Global analysis of lysine ubiquitination by ubiquitin remnant immunoaffinity profiling.

This data set contains 1 LC/MS/MS result. The manuscript describing this work was published by Xu G, Paige JS, and Jaffrey SR in Nat Biotechnol. 2010 28:868-73 (PubMed).

This data was obtained from a very interesting study that describes the utility of an immunoaffinity method for purifying the peptides generated by the trypsin digest of proteins that have N-lysyl-ubiquitination. Trypsin cleaves away most of the ubiquitin bound to the lysine sidechain, leaving a Gly-Gly sequence attached. By generating an antibody that was specific for this type of modified lysine sidechain, they were able to isolate peptides from ubiquitinated proteins. This purification allowed them to overcome the large concentration ratio between the modified and unmodified proteins that has made identifying this type of modification difficult in the past. The availability of this antibody should make many interesting studies of the ubiquitin-mediated protein degradation pathway possible.

Data set of the week: (2010/09/26)
The Asia Oceania Human Proteome Organisation Membrane Proteomics Initiative. Preparation and characterisation of the carbonate-washed membrane standard.

This data set contains 2 LC/MS/MS results. The manuscript describing this work was published by Peng L, Kapp EA, Fenyö D, Kwon MS, Jiang P, Wu S, Jiang Y, Aguilar MI, Ahmed N, Baker MS, Cai Z, Chen YJ, Van Chi P, Chung MC, He F, Len AC, Liao PC, Nakamura K, Ngai SM, Paik YK, Pan TL, Poon TC, Hosseini Salekdeh G, Simpson RJ, Sirdeshmukh R, Srisomsap C, Svasti J, Tyan YC, Dreyer FS, McLauchlan D, Rawson P, and Jordan TW. in Proteomics. 2010 May 18 (PubMed).

This study, the results of a HUPO-affiliated AOHUPO project, demonstrate the effectiveness of a standardized, relatively simple protocol for the enrichment of membrane proteins. A quick inspection of the GO displays for unwashed and carbonate washed samples proves this point very nicely. Many groups still seem to believe that membrane proteins are difficult to observe using proteomics methods, so a straightforward study such as this one demonstrating the contrary is a welcome addition to the field and an excellent subject for a HUPO study.

Data set of the week: (2010/09/19)
Plasticity and robustness of protein patterns during reversible development in the honey bee (Apis mellifera).

This data set contains 23 LC/MS/MS results. The original data was obtained from Peptidome (Study PSE129). The manuscript describing this work was published by Wolschin F, and Amdam GV in Anal Bioanal Chem. 2007 389:1095-100 (PubMed).

This study explores the protein levels associated with behavioral differences in honeybees. Apis mellifera is an economically important species with a complete genome but which has recieved only limited attention from the proteomics community. Fortunately bee proteomics scientists have been very active in contributing their data to public repositories. Inspection of the list of all A. mellifera proteins in GPMDB shows that more than 2450 proteins have been observed and a surprising number of them have been observed more than 500 times.

Data set of the week: (2010/09/12)
Identification of pathways associated with invasive behavior by ovarian cancer cells using multidimensional protein identification technology (MudPIT).

This data set contains 252 LC/MS/MS results. The original data was obtained from TRANCHE. The manuscript describing this work was published by Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, Ringuette MJ, Jurisica I, and Kislinger T. in Mol Biosyst. 2008 4:762-73 (PubMed).

This study contains probably the best information set for the detailed exploration of proteomics as a reproducible technology. Six different ovarian cancer cell lines were examined, each of which is analyzed in six replicates, each replicate containing six SCX fractions. While this study was designed to explore the differences between these cell lines, it also affords a truly useful collection of data for anyone interested in proteomics sample preparation reproducibility, measurement undersampling, search engine effectiveness, peak finding efficacy or any other aspect of proteomics data generation and handling.

The GPM results are grouped according to cell line replicates, with each replicate having six entries corresponding to the individual SCX fractions, followed by a summary result generated from those six analyses. A description containing a statement like "Data directory: SKOV_5" indicates that the result was obtained from replicate "5" of cell line "SKOV".

1,500,000,000 spectra analyzed (2010/09/07)

Some time yesterday (Sept. 6, 2010) the Global Proteome Machine processed its 1,500,000,000^th spectrum. We would like to thank all of the direct contributors to this project, as well as all of the investigators who have made there data available through TRANCHE, PRIDE and Peptidome.

Data set of the week: (2010/09/05)
A quantitative proteomics design for systematic identification of protease cleavage events.

This data set contains three (3) COFRADIC analyses (COmbined FRActional DIagonal Chromatography). The original manuscript describing this work was published by Impens F, Colaert N, Helsens K, Ghesquiere B, Timmerman E, De Bock PJ, Chain BM, Vandekerckhove J, and Gevaert K in Mol Cell Proteomics. 2010 Jul 13 (PubMed).

The study demonstrates a relatively straightforward method for determining the cleavage specificity of proteolytic enzymes. The data analysis technique used in the original paper is somewhat complex, but the more flexible modes of analysis available in the GPM simplied the process considerably. Simple inspection of the AAA display allows the assignment of the appropriate cleavage specificities for the enzymes:

xtandem-parser available (2010/09/01)

A novel parser to display X! series data output has been developed by Thilo Muth, Marc Vaudel, Harald Barsnes, Lennart Martens, and Albert Sickmann and published in Proteomics 2010, 10:1522-4 (PubMed). This parser allows the user to browse through results files and generate annotated diagrams of the individual spectra that support particular peptide sequence assignments. It is a nice, cross-platform implementation and doesn't require any additional software (other than Java) to run.

mzML 1.1.0 compatibility (2010/08/31)

Thanks to a code contribution from Fredrik Levander from the Department of Immunotechnology - Protein Technology at Lund University, X! Tandem, P3, Hunter and Common have been updated to deal with the latest version of mzML. The code repository has the appropriately altered version of the source code files

saxhandler.cpp;

saxmzmlhandler.cpp; and

saxmzmlhandler.h.

The code in the release versions will be changed in the next full release of the software. All of the public GPM search sites have been updated and people with GPM-XE can obtain the mzML compatible executables using the normal software updating link.

Request For Comment 2010.08.30: changes to cRAP

The common Repository of Adventitious Proteins project (cRAP) would like to request user comments on a proposed revision of the protein sequences currently included in its list of proteins. If you would like to comment, please email your suggestions to rfc@thegpm.org. This RFC will remain active until October 1, 2010, at which time the resulting changes will be posted on the new RFC page.

The proposed changes are as follows:

The removal of the "category 4" proteins from the list (see the cRAP page for a complete list of these proteins). The affected sequences (the Sigma/Aldrich Universal Protein Standard) were originally place in cRAP in anticipation of broad use of this mixture resulting in significant levels of contamination of non-human samples. Our analysis of the data shows that this has not happened. The continued presence of these alternate protein sequences in cRAP has given rise to added complexity in human data samples, which we feel it would be best to eliminate.
The addition of two new viral protein sequences. These sequences, the heavy and light chain of the human adenovirus C E1B control protein, are observed in proteomics data obtained from HEK293 cells. These proteins are not the result of an infection: they are caused by the DNA insertion that immortalized this cell line. These proteins have no significant homology to any non-viral proteins and their inclusion should not increase the complexity of analyzed results.
The addition of one new bacterial protein, the DNA K chaperonin from common cell culture infectious agent Mycoplasma hominis. This bacteria can be present at some level in cell culture and this protein is meant serve as an indicator of the potential infection of any sample derived from these cells. M. hominis ATP synthase subunit beta is another potential candidate, however its sequence has sufficient homology to mammalian ATP synthase that the danger of false positive assignments makes it a less attractive candidate marker.

Data set of the week: (2010/08/29)
Human Ccr4-Not complexes contain variable deadenylase subunits.

This data set contains nine (9) LC/MS/MS analyses. The original manuscript describing this work was published by Lau NC, Kolkman A, van Schaik FM, Mulder KW, Pijnappel WW, Heck AJ, and Timmers HT. in Biochem J. 2009 422:443-53 (PubMed).

The study contained eight (8) pulldown experiments and one (1) control. Each pull-down is annotated with the bait protein. The experiment uses the combination of Lys-C and bovine trypsin characteristic of the Heck group, which generates a rather complete set of trypsic peptides, although there were a signficant number of non-tryptic peptides generated. The sample preparation method used urea, so there was also a significant number of carbamylated peptides detected. Neither of these artifacts affect the conclusions of the study.

The study contains also contains a surprising number of protein identifications that are the best so far obtained in GPMDB, e.g., TNKS1BP1, RAVER1, FHL2, RQCD1, RNF219, UBAP2L, BAG3 as well as the bait CNOT proteins. Pull-down experiments, with their ability to purify an unusual fraction of proteins, seem to be very effective at obtaining the best observations of rare proteins, compared to large MudPit-style survey experiments.

Data set of the week: (2010/08/22)
Low abundance proteome of human red blood cells captured by combinatorial peptide libraries. Behavior of mono- to hexapeptides.

This data set contains 19 LC/MS/MS analyses. The original manuscript describing this work was published by Sim C, Bachi A, Cattaneo A, Guerrier L, Fortis F, Boschetti E, Podtelejnikov A, and Righetti PG. in Anal Chem 2008 80:3547-56 (PubMed).

This study is an excellent example of a very important class of study: attempting to use novel separation strategies to increase the dynamic range of tissue proteomics. The particular strategy used in this case appears to work quite well at obtaining distributions of proteins with limited specificity, while at the same time producing fractions depleted in high abundance proteins. Technically, the data is also very quality and it contains an unusual number of high confidence identifications of relatively small peptides (< 1000 Da).

ASLs for A. mellifera, A. carolinensis, X. tropicalis and X. laevis available (2010/08/18)

We are happy to be able to make available the first Annotated Spectrum Libraries for the honeybee (Apis mellifera), a lizard (Anolis carolinensis) and two amphibians (Xenopus tropicalis and Xenopus laevis). These libraries are now available for download (FTP) or for on-line searches at the X! Hunter search server.

Data set of the week: (2010/08/15)
Quantitative analysis of kinase-proximal signaling in lipopolysaccharide-induced innate immune response.

This data set contains 73 LC/MS/MS data sets of obtained from mouse RAW 264.7 cells (macrophage cell line) that have been treated with lipopolysaccharide to simulate infection with Gram-negative bacteria. This data was published by Sharma K, Kumar C, Kéri G, Breitkopf SB, Oppermann FS, and Daub H in J Proteome Res. 2010, 9:2539-49 (PubMed).

The goal of the paper was to follow TOLL-like receptor phospho-signaling during this sort of simulated infection using SILAC: a combination of unlabelled and labelled samples with two different isotopic tag pairs (K(4),R(6) and K(8)R(10)) were used to detect differential protein and phosphopeptide concentrations.

In addition to the biological conclusions, this data contains some excellent examples of a common analytical artifact associated with the use of titanium dioxide phosphopeptide enrichment. Metal oxide columns work by binding peptides with low pIs (i.e., acidic peptides). While phosphopeptides certainly fill the bill as being acidic relative to most peptides, normal peptide sequences with multiple acidic sidechains are also strongly enriched by these columns. This effect can be clearly seen by using the pI vs. RT and the amino acid analysis displays. In example used here, most of the peptides detected have a pI < 5. Aspartic acid (D) and glutamic acid (E) residues in the detected peptides are enriched to 250% and 215% of their expected composition, based on the composition of the associated proteins.

Release of the GPMDB Guide to the Human Proteome (2010/08/12)

The Human Proteome Organization (HUPO) is in the process of developing a Human Proteome Project (HPP) that will be officially announced at the upcoming HUPO meeting in Sydney this September. As our contribution to the planning process for this important initiative, the information in GPMDB has been summarized into a collection of spreadsheets that we are calling the GPMDB Guide to the Human Proteome. This guide has the information organized into separate spreadsheets for each chromosome, as well as three transposons and mitochrondrial DNA. The protein accession numbers, HGNC names and chromosomal coordinates were taken from ENSEMBL v. 58. This first version of the Guide is available in the following formats:

The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/human_proteome_guide/

Release 5 of NCTA tissue proteomics information available (2010/08/10)

The fifth release of the Normal Clinical Tissue Alliance list of proteins found in normal clinical tissue samples is now available. The information can be accessed through the NCTA HTML interface, which also allows the download of speadsheets for all available tissues. This release is the first to include lung (BTO:0000763), made possible by new data from Dan Lieber's group at Vanderbilt University. The BTO interface in GPMDB has also been updated with this new information.

Probing Proteins: the Scientist (2010/08/09)

This month's issue of "The Scientist" has a good general article on the use of online proteomics data called Probing Proteins, written by Jeffrey M. Perkel. It is written for the general biomedical science community with some explanations of how the different data and information sources can be used.

Data set of the week: (2010/08/08)
Comparative proteome profiling of Mycobacterium tuberculosis: the response of drug-resistant and drug-sensitive stains.

This data set contains 6 (six) MudPit data sets of two different strains of M. tuberculosis, A12998 (daughter strain, drug-resistant) and A7494 (parent strain, drug-sensitive). This data was published via upload to Peptidome as Study PSE133 by Moo-Jin Suh, Rembert Pieper, and Shih-Ting Huang from the J. Craig Venter Institute.

From Peptidome: The study describes the analysis of proteins from Drug-resistant and -sensitive strains of Mycboacterium tuberculosis. LC-MS-based proteomics approach was combined with APEX to quantitatively measure relative proteins abundance and to compare the cellular protein composition of Mycobacterium tuberculosis strains A12998 (daughter strain, drug-resistant) and A7494 (parent strain, drug-sensitive).

The results are probably the most thorough analysis of proteins from this important pathogen and they make up a large fraction of the Annotated Spectrum Libraries available from M. tuberculosis strains.

An unexpected piece of information made available through this data set is a good initial measurement of the phosphoproteome of this prokaryote. M. tuberculosis is known to have a serine/threonine kinase and this data set has a number of very good phophopeptides generated by this kinase. The kinase appears to prefer threonine phosphorylation, with a S:T ratio of about 1:3. This ratio is the reverse of typical eukaryote kinases, which seem to prefer serine by about 3:1. The phosphoproteome generated from this study is available in either Excel, html or tab-separated text formats, as projected on to the proteome of strain CDC1551. Note: the original analysis in Peptidome did not include phosphorylation, so these results are only present in the GPMDB re-analysis. It would be very useful to have an IMAC-type study done on these and other M. tuberculosis strains.

Data set of the week: (2010/08/01)
In-depth proteomic analyses of direct expressed prostatic secretions.

This data set contains 9 (nine) MudPit data sets, each measured from a different prostatic fluid sample from individuals with prostate cancer. The original raw data was obtained from TRANCHE. It was published by Drake RR, Elschenbroich S, Lopez-Perez O, Kim Y, Ignatchenko V, Ignatchenko A, Nyalwidhe JO, Basu G, Wilkins CE, Gjurich B, Lance RS, Semmes OJ, Medin JA, and Kislinger T. in J Proteome Res. 2010, 9:2109-16 (PubMed).

The results show the amount of variability that can be expected when analyzing biological replicates of clinically sampled material. The identifications were very high quality and are the best quality measurements of many rather rare proteins, such as KLK3 (prostate-specific antigen) and ACPP (Prostatic acid phosphatase). The data shows moderate levels of carbamylation from the urea solublization method used. There were also significant concentrations of peptides generated by non-tryptic cleavage, probably from the presence of proteases in the sample itself as the cleavage sites were not chymotryptic. An examination of the AAA page (e.g., sample #2) showed that the "Pre" and "C-terminal" columns were broadly populated for most residues, not just the K and R residues normally expected in a trypsin cleavage experiment.

Interestingly for a sample obtained from prostate secretions, no proteins originating from genes on the Y chromosome were detected. This fact points out a general feature of proteomics: there does not seem to be any "common sense" association between tissue-specific protein concentrations and chromosomes.

Data set of the week: (2010/07/25)
Proteomic analysis of the secretome of human umbilical vein endothelial cells using a combination of free-flow electrophoresis and nanoflow LC-MS/MS.

This data set contains a single LC/MS/MS data set, using a combination of free-flow electrophoresis and nanoflow HPLC separations. The original raw data was made available as a Scaffold file from a web site maintained by the authors (www.vascular-proteomics.com). It was published by Tunica DG, Yin X, Sidibe A, Stegemann C, Nissum M, Zeng L, Brunet M, and Mayr M in Proteomics. 2009, 9:4991-6 (PubMed).

This study attempts to discover a difficult thing: the secretome of human umbilical vein endothelial cells in the face of the background proteins in a complex growth medium. The results provide a good basis for the examination of this important cell type, with a very good set of identifications that provides a broad survey of the proteins that can be readily obtained from these cells.

Additions to the S. cerevisiae proteome (2010/07/21).

As has been mentioned several times in the Data Set of Week announcements, proteins from the two viruses Saccharomyces cerevisiae virus L-A (L1) and Saccharomyces cerevisiae virus L-BC (La) have been very commonly observed in proteomics data sets obtained from S. cerevisiae. The signals associated with peptides from these viruses can be quite strong, so it is our belief that not including the proteome of these viruses when searching S. cerevisiae data may lead to missed (or misinterpreted) identifications. Therefore, the proteome of S. cerevisiae has been altered on all of the GPM public search sites to include the proteomes of these viruses (NC_003745 and NC_001641) by default. These two viruses contain a total of five (5) proteins, so there will be little impact on the overall search speed caused by this proteome-level change.

One of these proteins (SwissProt: GAG_SCVLA) has also been added to the cRAP sequence collection.

New X! Hunter Annotated Spectrum Libraries for model species available (2010/07/20).

The most recent versions the the Annotated Spectrum Libraries for H. sapiens, M. musculus, R. norvegicus, S. cerevisiea, D. melanogaster and C. elegans are now available at the X! Hunter project ftp site. These libraries use the ENSEMBL v. 58 protein sequences. Specifics of the build are available at the ASL statistics page. The libraries are available for searches at the X! Hunter server.

Data set of the week: (2010/07/18)
Proteomics Analysis of the Causative Agent of Typhoid Fever.

This data set contains 313 LC/MS/MS runs using Thermo LTQ mass spectrometers. The original raw files originally from the Resource Center for Biodefense Proteomics Research, which has been superceded by the Pathogen Portal (raw data). It was published by Ansong C, Yoon H, Norbeck AD, Gustin JK, McDermott JE, Mottaz HM, Rue J, Adkins JN, Heffron F, and Smith RD in J Proteome Res. 2008, 7:546-57 (PubMed).

This very thorough data set is the primary large collection of information that has allowed for the creation of the rather comprehensive annotated spectrum libraries that are now available for S. enterica related species, including S. typhi and S. typhimurium. The Pacific Northwestern National Laboratory group was an early proponent of making publicly-funded proteomics raw data widely available and their efforts legitimized the idea for many other groups.

Data set of the week: (2010/07/11)
Discovery of Anthrax Biomarkers Using Label-Free Quantitative Phosphoproteomics via Mass Spectrometry.

This data set contains 66 individual phosphopeptide enriched LC/MS/MS runs made using a Thermo Orbitrap hybrid mass spectrometer. The original raw files were transferred from TRANCHE. The data was credited to Nathan P. Manes, Li Dong, Weidong Zhou, Xiuxia Du, Nikitha Reghu, Arjan C. Kool, Dahan Choi, Charles L. Bailey, Emanuel F. Petricoin III, Lance A. Liotta, and Serguei G. Popov. It was made available prior to publications, although some part of the data was presented at the 2010 ASMS conference.

The analyzed results are simply the best, most consistent set of phosphopeptide results that we have ever seen. The combination of sample preparation, HPLC and mass spectrometry used by the authors has generated what can only be considered a milestone in the application of phospho-proteomics technique to real tissue samples.

X! Hunter Annotated Spectrum Libraries for 115 prokaryote and 4 virus proteomes are now available

The X! Hunter ASL collection of bacterial proteome libraries has been expanded from the original 16 species to now include 115 prokaryote species and strains. The new species include important pathogenic organisms, such as Shigella dysenteriae, Yersinia pestis and Brucella abortus. The new release also includes 30 individual strains of Escherichia coli. This update of the ASL collection has, for the first time, libraries for four viruses:

Monkeypox virus Zaire 96 I 16,
Saccharomyces cerevisiae virus L A L1,
Saccharomyces cerevisiae virus L BC La, and
Vaccinia virus.

All of these new libraries are available to the public for download or for performing searches.

Data set of the week: (2010/07/04)
Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions.

This data set contains 61 individual experiments using both SILAC and label-free quantitation. The experimental protocols used either trypsin or endo-LysC to digest the proteins, depending on the type of protocol being used. The original raw files were transferred from TRANCHE. The data was published by Hubner NC, Bird AW, Cox J, Splettstoesser B, Bandilla P, Poser I, Hyman A, Mann M in J Cell Biol. 2010 189:739-54 (PubMed).

The data was generated to demonstrate the utility of a new technique for protein quantitation developed by the authors: "quantitative BAC-green fluorescent protein interactomics" (QUBIC). The technique is meant to be applied to the quantitative study of protein-protein interactions, several of which are demonstrated here. The technical quality of the MS/MS data is excellent, with many ids for individual proteins in the top 10% of all GPMDB observations.

Video tutorial (2010/07/03): Finding phosphorylation sites using GPMDB

We have started to make a set of tutorial videos to explain how to use GPMDB for common biomedical research tasks. The first video in this set describes the steps necessary to find the observed phosphorylation sites for a particular protein. The description is in the form of a casual conversation between a biomedical researcher (as played by HRM Queen Elizabeth II) and a GPMDB power user (Beavo the clown), during a chance meeting at the dépanneur. We will be releasing these videos as they are produced. Please see our new tutorials page to check for new videos.

Data set of the week: (2010/06/27)
mTAL Phosphoproteome Data.

This data set contains metal oxide enriched LC/MS/MS observations of phosphopeptides from R. rattus medullary Thick Ascending Limb (mTAL) cells. The raw files were transferred from TRANCHE. The original analysis was reported by Ruwan Gunaratne, Guozhong Ma, Trairak Pisitkun, and Mark A. Knepper as part of the mTAL-PD project. It appears to be closely related to the Collecting Duct Phosphoproteome Database.

The phosphorylated domains obtained are interesting because there is surprisingly little publicly available data from rat cell lines or tissue samples. The phosphopeptide enrichment here was somewhat less effective than in some other studies, however overall it is quite typical of IMAC phosphopeptide enrichment studies. This study has significantly added to the known phosphorylated domains for available R. rattus through GPMDB's pSYT interface.

Data set of the week: (2010/06/20)
Proteomic analysis of mouse brain microsomes: identification and bioinformatic characterization of endoplasmic reticulum proteins in the mammalian central nervous system.

This data set contains 1 2DLC MS/MS and 3 1DLC MS/MS rusn obtained from mouse brain microsomal preparations. The original data was transferred from TRANCHE. The original data analysis was reported by Stevens SM Jr, Duncan RS, Koulen P, Prokai L. in J Proteome Res. 2008 7:1046-54. (PubMed).

This data set is interesting in a number of ways. It shows the difference in the depth of analysis available using of multi-dimensional chromatographic analysis versus simple, single separation HPLC. The three repetitions of the 1D LCMS approach give a good indication of the statistical variability that is to be expected caused by the under-sampling inherent in this type of measurement. A Gene Ontology analysis of the data (e.g., GPM33080005862) shows the complexity of real microsomal samples, compared to simply believing that they contain only membrane and membrane-associated proteins. A similar study can be compared, showing some significant differences in microsome proteome composition, which are most likely due to variations in the sample preparation methods.

Data set of the week: (2010/06/13)
The minor salivary gland proteome in Sjögren's syndrome.

This data set contains 2 LC-MS-MS runs obtained from human salivary gland tissue. The original data was transferred from PRIDE entries 7962-3. The data was reported by Hjelmervik TO, Jonsson R, Bolstad AI. in Oral Dis. 2009 15:342-53. (PubMed).

The two sets of identifications are meant to show the differences in the protein compliment of salivary glands caused by the autoimmune disease, Sjögren's syndrome. Technically, the data is a good example of the use of a high resolution MS/MS device (ESI-QTOF, Ultima Global) applied to tissue samples. The high accuracy fragment ion masses significantly improve the quality of the identifications.

Data set of the week: (2010/06/06)
Identification of Ricin and Concanavalin A-binding Trypanosoma brucei Glycoproteins.

This data set contains 1 data set obtained from T. brucei. The original data was transferred from PRIDE 9223. A portion of the data was report by Izquierdo L, Schulz BL, Rodrigues JA, Güther ML, Procter JB, Barton GJ, Aebi M, Ferguson MA in EMBO J. 2009 28:2650-61 (PubMed).

The data was obtained by using the the lectins concanavalin A and ricin to pull down glycoproteins from T. brucei (blood stream form) and then glycosidases were used to remove the N-linked glycosylation, leaving a deamidated asparagine residue behind. Any deamidated N residue that was associated with the N-{P}-[ST] glycosylation motif should be considered a potential N-linked glycosylation site. You can see just these peptides by clicking here.

Mouse protein phosphorylation sites (2010/06/05)

As a companion to the list of known human phosphorylation sites, we have also compiled a similar list for the mouse proteome, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 10,266 phosphorylation sites on 4,209 protein sequences, with the following composition:

serine: 5,406;
threonine: 1,617; and
tyrosine: 3,243

Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:

ENSMUSP00000028190

Abl1

Y[253]4

Y[393]9

T[394]6

Y[469]6

The columns have the following interpretation:

The ENSEMBL accession number for the protein splice variant;
The MGI gene name associated with that accession number: there may be many splice variants with the same gene name; and
The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).

We have to again thank all of the data contributors who have made these comprehensive lists possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.

Human protein phosphorylation sites (2010/06/04)

We have come up with a list of known human phosphorylation sites, based on the data in GPMDB, filtered through the same curation and quality control process that is used to create the Annotated Spectrum Library collection. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 28,089 phosphorylation sites on 10,670 protein sequences, with the following composition:

serine: 16,806;
threonine: 4,361; and
tyrosine: 6,922

Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:

ENSP00000344789

ACACA

S[66]6

S[117]7

S[350]6

Y[1190]7

The columns have the following interpretation:

The ENSEMBL accession number for the protein splice variant;
The HGNC gene name associated with that accession number: there may be many splice variants with the same gene name; and
The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).

We have to thank all of the data contributors who have made this type of comprehensive list possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.

Data set of the week: (2010/05/30)
Use of fluorescence-activated vesicle sorting for isolation of naked2-associated, basolaterally-targeted exocytic vesicles for proteomic analysis.

This data set contains 6 experiments obtained from C. familiaris and it is probably the best single data set we have in GPMDB from the domestic dog proteome. This work was transferred from TRANCHE and it was published by Cao Z, Li C, Higginbotham JN, Franklin JL, Tabb DL, Graves-Deal R, Hill S, Cheek K, Jerome WG, Lapierre LA, Goldenring JR, Ham AJ, Coffey RJ. in Mol. Cell. Proteomics 2008, 7:1651-67 (PubMed).

The individual experiments show how well fairly straightforward proteomics techniques can perform on vesicular membrane proteins. They also demonstrate of the type of comprehensive results that can be obtained using a proteome sequence that is almost completely the result of genome annotation.

Homo sapiens microbiome associated proteome available for searches (2010/05/29)

In honour of the Human Microbiome Project publication in Science, we have compiled all of proteomes translated for the Human Microbiome Project and assembled them into a searchable FASTA file. You can add all of these proteomes to your searches using either the normal or human search pages (it is the first selection in the "prokaryotes" box).

Email should be OK (& a rendering change) (2010/05/26)

Our Email changes are complete, so all email should be OK as of today.

In order to maintain compatibility with the latest version of the web browser Chrome, we've had to disable the 3D rendering of the protein coverage displays. Once we've figured out how to deal with the changes in Chrome (or a new release of Chrome fixes the problem), we will reinstate the 3D rendering.

Email changes (2010/05/25)

We are changing our email system, so for the next few days emails sent to "thegpm.org" addresses might not be received. We are sorry for any inconvenience.

Data set of the week: (2010/05/23)
A Global Protein Kinase and Phosphatase Interaction Network in Yeast.

This data set contains 450 pull-down experiments obtained from S. cerevisiae. This work was transferred from TRANCHE and it was published by Ashton Breitkreutz, Hyungwon Choi, Jeffrey R. Sharom, Lorrie Boucher, Victor Neduva, Brett Larsen, Zhen-Yuan Lin, Bobby-Joe Breitkreutz, Chris Stark, Guomin Liu, Jessica Ahn, Danielle Dewar-Darch, Teresa Reguly, Xiaojing Tang, Ricardo Almeida, Zhaohui Steve Qin, Tony Pawson, Anne-Claude Gingras, Alexey I. Nesvizhskii, Mike Tyers Science 2010 328:1043-6.

Each of the individual results is annotated with the identity of the bait used in the pull-down experiment. L-A and L-BC virus proteins are present in some of the pull-downs. The group did a remarkably job at detecting phosphopeptides for a study that did not do any specific enrichment for these peptides.

The iPhone wins (2010/05/20)

There are some people (probably with very good eye-sight) that use GPMDB on their mobile phones. The chart below gives the breakdown of system usage by telephone operating system, showing that the three most used mobile operating systems are

iPhone (Apple),
Symbian (mostly Nokia), and
a tie between Android (Google) and Blackberry (RIM).

Data set of the week: (2010/05/16)
Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast.

This data set contains 505 LC/MS/MS runs obtained from S. cerevisiae diploid and haploid populations. This work was transferred from TRANCHE and it was published in de Godoy LM, Olsen JV, Cox J, Nielsen ML, Hubner NC, Fröhlich F, Walther TC, Mann M. Nature. 2008 455:1251-4. (PubMed).

The results give a good indication of the relative abundance and observability of yeast proteins in both haploid and diploid cells using either trypsin or endopeptidase LysC to generate peptides and SILAC labels to provide relative quantitation. The data also shows very good examples of the major proteins observable from the double stranded DNA viruses L-A and L-BC that are almost ubiquitously present in yeast cell cultures. In some cases, these proteins are very strongly observed (e.g. protein #3 in GPM77711001229) and the SILAC labelling can used to estimate the relative amounts of virus present in the two cell types. To located the virus and virus-related proteins in any of the individual runs, type "virus" into the Find box at the top of any model page (click here for an example).

Data set of the week: (2010/05/09)
Phosphoproteome analysis of Drosophila melanogaster embryo.

This data set contains 24 LC/MS/MS runs obtained from D. melanogaster embryos. This work was transferred from TRANCHE and it was published in Zhai B, Villén J, Beausoleil SA, Mintseris J, Gygi SP, J Proteome Res. 2008 7:1675-82 (PubMed).

The assignments in this data set give a good overview of phosphorylation in D. melanogaster and they are good examples of phosphopeptides identified using an Orbitrap-LTQ hybrid instrument with CID. The mapped phosphorylation sites from this data set were a major contribution to the pSYT annotation now available for the fruit fly. The predominance of yolk proteins and other larvae-specific proteins in the identified peptides gives a good view of the phosphorylation patterns on proteins that may be under-represented or absent from studies that use mature flies or cells from tissue culture.

System change: discontinuation of IPI sequences (2010/05/03)

As mentioned in an earlier post, protein sequences using the International Protein Index accession number scheme were discontinued in GPM search servers as of May 1, 2010. The removal of this accession number system was made necessary because the European Bioinformatics Institute (EBI), which originated IPI, has discontinued their support for IPI sequences. All searches that have been performed using IPI accessions will still be available and annotation for those searches will be maintained as long as possible. The ability to convert ENSEMBL to IPI protein accession numbers will be maintained until ENSEMBL discontinues its support for this type of conversion.

Data set of the week: (2010/05/02)
Activated Macrophage Proteomics

This data set contains 9 merged results obtained from human macrophages under various conditions. This work was transferred from a TRANCHE project of the same name, created and maintained by Maureen M. Goodenow, Dept. of Pathology, Immunology and Laboratory Medicine University of Florida.

The experiments reported by Dr. Goodenow are proteomics survey studies of macrophages, in which the proteomes of treated cells are separated by SDS-PAGE and the resulting gel is sliced into 15 pieces. The proteins are then digested, the peptides extracted and run using LC/MS/MS. Each one of the entries in GPMDB correspond to the merged results of the 15 bands. They are good examples of what can be done using gel-slicing experiments to obtain proteomics information about a cell type. It is also an admirable example of valuable data being made available to the general community by an individual investigator.

Data set of the week: (2010/04/25)
Large-scale quantitative LC-MS/MS analysis of detergent-resistant membrane proteins from rat renal collecting duct.

This data set contains 78 LC/MS/MS runs obtained from membrance enriched fractions of tissue samples from rat renal ducts. It was originally published by Yu MJ, Pisitkun T, Wang G, Aranda JF, Gonzales PA, Tchapyjnikov D, Shen RF, Alonso MA, Knepper MA. in Am J Physiol Cell Physiol. 2008 295:C661-78 (PubMed). The data was transferred to GPMDB from TRANCHE.

This study demonstrates that it is possible to generate very good results from membrane proteins isolated from tissue, even those that do not readily dissolve in detergent solutions, such as lipid raft proteins. GO analysis of the resulting protein identifications shows very significant enrichments in proteins known to be either integral membrane, membrane associated or part of the extracellular matrix.

Data set of the week: (2010/04/18)
Targeted tandem affinity purification of PSD-95 recovers core postsynaptic complexes and schizophrenia susceptibility proteins.

This data set contains 70 LC/MS/MS runs obtained using TAP-tag protein isolation, SDS-PAGE separation followed by tandem mass spectrometry. It was originally published by Fernández E, Collins MO, Uren RT, Kopanitsa MV, Komiyama NH, Croning MD, Zografos L, Armstrong JD, Choudhary JS, Grant SG. Mol Syst Biol. 2009;5:269 (PubMed). The data corresponds to the PeptideAtlas accession PAe001454 and was transferred to GPMDB.

The results are a good demonstration of the depth and detail of a particular molecular system that can be obtained by coupling TAP-tagging with protein and subsequent peptide separations. The use of multiple gel slices allows a depth of proteome coverage that would be difficult to obtain using other techniques.

Updated NCBI bacterial proteomes (2010/04/15)

The public GPM search servers have been updated with the most recent set of bacterial proteomes available from the US National Center for Biotechnology Information. This new set of sequences adds approximately 200 new species to the list of proteomes available. Multiple new species of human pathogens have been added, as well as additional strains of species previously available.

ENSEMBL changes affecting GPM species (2010/04/13)

ENSEMBL has been widening its offerings of proteome sequences and annotation over the last few months. Of the most utility to GPM has been the addition of server systems to specifically deal with non-vertebrate species, such as fungi, non-vertebrate metazoa, plants and protists. We are in the process of converting some of the references and annotation sources in GPM to take advantage of these new resources. To date, the following species have been switched to using ENSEMBL metazoa:

A. gambiae,
C. elegans, and
D. melanogaster.

The following species have been switched to using ENSEMBL plant:

O. sativa,
A. thaliana, and
B. distachyon - this new proteome has been added to plant.thegpm.org.

The following species have been switched to using ENSEMBL fungi:

S. pombe.

Data set of the week: (2010/04/11)
Proteomics of mouse liver microsomes

This data set contains 9 LC/MS/MS runs obtained using SDS-PAGE separation followed by tandem mass spectrometry. It was originally published by Zgoda VG, Moshkovskii SA, Ponomarenko EA, Andreewski TV, Kopylov AT, Tikhonova OV, Melnik SA, Lisitsa AV, and Archakov AI in Proteomics, 2009,9:4102-5 (PubMed). The data corresponds to the PRIDE accessions 8848-8856 and was transferred to GPMDB.

This data set is an example of the isolation of a specific experimental fraction (mouse liver microsome from the endoplasmic reticulum) that provides a good representation of proteins not commonly observed, in this case the cytochrome P450 family of metabolic oxidases. The quality of the isolation can be easily seen when viewed as either KEGG pathways or GO cellular components.

STRING-DB link for data models (2010/04/08)

STRING-DB is a long running project for the study of protein-protein interactions at a number of different levels. A new link has been added to GPM "Main model display" pages to make it easy for users to take advantage of the information in STRING-DB. The new link is shown below:

As an example of the use of this new feature, try this data set. Click on the "string-db" link of the display and then click the "Continue" button on the data selection page generated by STRING-DB. You will then get a protein-protein interaction display that indicates known interactions between the proteins that were discovered in the original data set. STRING-DB has a number of interesting features for increasing or decreasing the stringency of the interactions displayed as well as several different views on the data.

This feature is available on all data models that were constructed using ENSEMBL sequences.

Data set of the week: (2010/04/04)
Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

This data set contains 369 LC/MS/MS runs obtained using a Thermo Finnigan LTQ instrument. It was originally published by Merrihew GE, Davis C, Ewing B, Williams G, Käll L, Frewen BE, Noble WS, Green P, Thomas JH, MacCoss MJ. in Genome Res. 2008, 18:1660-9 (PubMed). The data was obtained directly from the authors' web site and it is not currently held in any of the other data sites.

The original analysis of this data set in the publication used the C. elegans WS150 proteome sequence and it was found to indicate the presence of additional coding sequences. The analysis in GPMDB was performed using the WS200 proteome (ENSEMBL v. 55), which has taken into account the original work. It serves as a good example of the proteins that can be seen using conventional proteomics techniques in C. elegans.

IPI closure by European Bioinformatics Institute (2010/03/28)

The International Protein Index (IPI) will be closing later this year. Because of this change, we will be discontinuing IPI protein sequences as an option effective May 1, 2010. All existing IPI sequence searches will be retained in GPMDB and we will attempt to keep the annotation information for these sequnces available as long as is practical. We would strongly suggest that anyone who is currently using IPI should convert over to using ENSEMBL sequences as soon as possible.

Data set of the week: (2010/03/28)
Global proteomic profiling of Shigella dysenteriae Sd1617

This data corresponds to Peptidome Study PSE140, comprised of samples PSM1302, PSM1303 and PSM1304 The data was obtained by Rembert Pieper, Srilatha Kuntumalla, Shih-Ting Huang at the J. Craig Venter Institute and it was transferred from Peptidome.

Each of the samples is composed of 3 replicate multidimensional chromatography runs of soluble proteins obtained from S. dysenteriae. The tandem mass spectra are good quality, obtained using a Thermo LTQ instrument. The results give a good indication of the type of depth and reproducibility that can be expected in this type of straight-forward analysis of soluble proteins from an enterobacterial culture.

Data set of the week: (2010/03/21)
Global Impact of Oncogenic Src on a Phosphotyrosine Proteome

The data is composed of 31 separate runs. The data was obtained from a study published in J. Proteome Res., 2008, 7 (8), pp 3447–3460, by Weifeng Luo, Robbert J. Slebos, Salisha Hill, Ming Li, Jan Brbek, Ramars Amanchy, Raghothama Chaerkady, Akhilesh Pandey, Amy-Joan L. Ham and Steven K. Hanks (DOI: 10.1021/pr800187n). This information was transferred from TRANCHE.

The data investigates the impact of Src transformation of mouse cells by determining the tyrosine phosphorylation differences between control and transformed cells. The data also demonstrates the utility of using multiple peptidases to increase the coverage of peptides, compared to trypsin alone. The data is very high quality LTQ data and it is an excellent reference work for what is to be expected when looking for mouse tyrosine phosphophorylation.

Data set of the week: (2010/03/14)
Quantitative phosphoproteomic analysis reveals vasopressin V2-receptor-dependent signaling pathways in renal collecting duct cells.

The data is composed of 2 separate sets, corresponding to the Peptidome accession numbers PSM1275 and PSM1276. The data was obtained from a study published in Proc Natl Acad Sci U S A. 2010 Feb 23;107(8):3882-7, by Rinschen MM, Yu MJ, Wang G, Boja ES, Hoffert JD, Pisitkun T, and Knepper MA (PubMed). This information was transferred from TRANCHE. The data is of high quality, containing good identifications of serine and threonine phosphorylation sites in M. musculus proteins and it is an excellent example of the use of SILAC to monitor the relative quantitation of protein phosphorylation.

Data set of the week: (2010/03/07)
Phosphorylation dynamics during early differentiation of human emrbyonic stem cells.

The data is composed of 12 individual LC/MS/MS runs obtained from a study published in Cell Stem Cell, Volume 5, Issue 2, 214-226, 7 August 2009 by Van Hoof D, Muñoz J, Braam SR, Pinkse MW, Linding R, Heck AJ, Mummery CL, and Krijgsveld J. (PubMed). This information was transferred from TRANCHE. Each of these data sets is large and contain significant numbers of phosphorylated peptides.

The experiments performed were to investigate how "pluripotent stem cells self-renew indefinitely and possess characteristic protein-protein networks that remodel during differentiation. How this occurs is poorly understood. Using quantitative mass spectrometry, the (phospho)proteome of human embryonic stem cells (hESCs) was analyzed during differentiation induced by bone morphogenetic protein (BMP) and removal of hESC growth factors."

Data set of the week: (2010/02/28)
A Lectin HPLC Method to Enrich Selectively-glycosylated Peptides from Complex Biological Samples.

The data is composed of 83 individual LC/MS/MS runs obtained from a study published in J Vis Exp. 2009 Oct 1;(32). pii: 1398 by Johansen E, Schilling B, Lerch M, Niles RK, Liu H, Li B, Allen S, Hall SC, Witkowska HE, Regnier FE, Gibson BW, Fisher SJ, and Drake PM (PubMed). This information was transferred from TRANCHE.

Briefly, plasma was depleted of the fourteen most abundant proteins using a multiple affinity removal system. Depleted plasma was trypsin-digested and separated into flow-through and bound fractions by SNA or AAL HPLC. The fractions were treated with PNGaseF to remove N-linked glycans, and analyzed by LC-MS/MS on a QStar Elite. There is an accompanying video explaining the methods used.

Data set of the week: (2010/02/21)
Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors.

The data is composed of 729 individual LC/MS/MS runs obtained from a study published in Nature Biotechnology by Bantscheff M, Eberhard D, Abraham Y, Bastuck S, Boesche M, Hobson S, Mathieson T, Perrin J, Raida M, Rau C, Reader V, Sweetman G, Bauer A, Bouwmeester T, Hopf C, Kruse U, Neubauer G, Ramsden N, Rick J, Kuster B, and Drewes G. (DOI: 10.1038/nbt1328). This information was transferred from PRIDE (PRIDE accession numbers 2445-3178).

Labelling with iTRAQ is used for quantitative profiling of the consequences of the introductions of tge drugs imatinib (Gleevec), dasatinib (Sprycel) and bosutinib in K562 cells confirms known targets including ABL and SRC family kinases.

Data set of the week: (2010/02/14)
Cell-Specific Information Processing in Segregating Populations of Eph Receptor Ephrin-Expressing Cells.

This dataset was transfered to GPMDB via ProteoExchange from PRIDE. The data is composed of 2 large LC/MS/MS runs is from a study published in Science by Jørgensen C, Sherman A, Chen GI, Pasculescu A, Poliakov A, Hsiung M, Larsen B, Wilkinson DG, Linding R, and Pawson T (DOI: 10.1126/science.1176615).

The data is from a set of quantitative mass spectrometric analyses of mixed populations of EphB2- and ephrin-B1–expressing cells that were labeled with different isotopes revealed cell-specific tyrosine phosphorylation events. The data is of very high quality and it has a very rich set of tyrosine phosphorylated peptides.

Data set of the week: (2010/02/07)
The value of using multiple proteases for large-scale mass spectrometry-based proteomics.

This dataset was transfered to GPMDB via ProteoExchange from TRANCHE. The data is composed of 15 LC/MS/MS runs is from a study published in J. Proteome Research by Danielle L. Swaney, Craig D. Wenger and Joshua J. Coon (DOI: 10.1021/pr900863u).

The data is from experiments in which an S. cerevisiae whole cell lysate was digested with one of five enzymes (trypsin, LysC, ArgC, AspN, and GluC), in triplicate. The results clearly show that any of these proteases can be used very effectively with standard proteomics equipment, giving very similar protein identifications.

New database server added at Rockefeller University (2010/02/05)

Starting today, a new database server has been added to the GPMDB system, based at Rockefeller University in New York City. This new server joins the other servers at the University of Manitoba, the University of British Columbia and Beavis Informatics, which make up the GPMDB cloud system.

Data set of the week: (2010/01/31)
Identifying blood biomarkers and physiological processes that distinguish humans with superior performance under psychological stress.

This dataset was transfered to GPMDB via ProteoExchange from PRIDE (Pride accessions 10075-10092). The data (GPM77710000113-GPM77710000130) is from a study published in PLoS One by Cooksey AM, Momen N, Stocker R, and Burgess SC (PLoS One. 2009 Dec 18;4(12):e8371 PubMed).

The results show the plasma proteins that change in response to the Modular Egress Training psychological stress test, given to a group of naval aviation students. The data was obtained using an LCQ DECA XP Plus and analyzed using X! Hunter (annotated spectrum library searches).

GPM sites using the new X! Tandem (2010/01/27)

Starting today, the public GPM servers will be using the new release of X! Tanden and X! P3 (2010.01.01.1). Once live testing is complete, the release code for this new version will be made available.

Features new to 2010.01.01 are improved handling of protein N-terminii and improved handling of phosphorylated peptides, through the detection of associated neutral losses. The new parameter set includes the following:

quick acetyl - protein N-terminal modification detection,
stP bias - interpretation of peptide phosphorylation models, and
quick pyrolidone - peptide N-terminus cyclization detection.

Data set of the week: (2010/01/24)
High quality catalog of proteotypic peptides from human heart

This dataset was transfered to GPMDB from the authors' web site, corresponding to the manuscript of the same name, Kline, KG, et al.,J Proteome Res. 2008 Nov;7(11):5055-61. PubMed. This data is not currently available on other respositories.

The data consists of 96 LCMS runs analyzed with a ThermoFinnigan LTQ mass spectrometer. It is a good example of the type of data that can be obtained from cardiac muscle using multidimensional chromatography directly on tissue lysate.

Data set of the week: (2010/01/17)
A Mitochondrial Protein Compendium Elucidates Complex I Disease Biology

This dataset was transfered to GPMDB from TRANCHE, corresponding to the manuscript of the same name, Pagliarini, DJ, et al., Cell 134:112-123 doi:10.1016/j.cell.2008.06.016.

The data consists of 26 individual data sets, composed of replicates of mitochondrial proteins obtained from a variety of mouse tissues (cerebellum, cerebrum, brainstem, spinal cord, kidney, liver, heart, skeletal muscle, testis and placenta). It is a good example of high quality proteomics data, obtained using a Thermo-Finnigan Orbitrap hybrid mass spectrometer.

Data set of the week: (2010/01/10)
Comparative analysis of the human and mouse placental transcriptome and proteome

This dataset was transfered to GPMDB from Peptidome via ProteoExchange, from the Peptidome entries PSM1063 (mouse) and and PSM1064 (human). The cells in the tissue were separated from extracellular proteins and various subcellular fractions were analyzed separately. The data was originally published in Cox B, et al., Mol Syst Biol 2009;5:279. PMID: 19536202.

Note: the Peptidome entry misidentifies the mass spectrometry platform as being an "TRAP-FTMS" while it is actually a Thermo-Finnigan LTQ (with no additional hybrid component).

Data set of the week: (2010/01/03)
Large-scale phosphorylation analysis of mouse liver

This dataset was transfered to GPMDB from TRANCHE and it is not currently held in any other ProteoExchange database (see data). It is credited to Villén J, Beausoleil SA, Gerber SA, and Gygi SP, and it is described in Proc Natl Acad Sci U S A. 2007 Jan 30;104(5):1488-93.

This data set is a good example of the quality of phosphorylation data that can be obtained using SCX separation of a tissue extract, followed by IMAC phosphopeptide enrichment of each fraction, when using an LTQ-Orbitrap mass spectrometer. The data view that is obtained from the link above shows all of the detected phosphopeptides, with a peptide false positive rate of ~ 0.14%, i.e., about 10 times more stringent than the analysis in the original paper.