The Global Proteome Machine Organization The Global Proteome Machine
The home of proteomics crowd-sourced "Big Data"
   Index of GPMDB lists
Proteomics often requires the assembly of category wide lists of things. These categories can be proteins associated with particular sequence or biological properties, post-translational modifications, or types of experiments. GPMDB can be used to generate of these lists and this page serves as an index to the lists announced for the system.
   Available lists of things
Post-translational modifications:
  1. C. elegans: phosphorylation
  2. D. melanogaster: acetylation, phosphorylation
  3. M. musculus: acetylation, phosphorylation
  4. Mycobacterium tuberculosis: phosphorylation
  5. H. sapiens: acetylation, phosphorylation
  6. S. cerevisiae: acetylation, phosphorylation
GPMDB Guide to the Human Proteome (2012/1/3)
The human protein identification information in GPMDB has been summarized into a collection of spreadsheets that we are calling the GPMDB Guide to the Human Proteome. This guide has the information organized into separate spreadsheets for each chromosome, as well as three transposons and mitochrondrial DNA. The protein accession numbers, HGNC names and chromosomal coordinates were taken from ENSEMBL v. 65. This edition of the Guide (2012.01.01) is available in the following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/human_proteome_guide/
GPMDB Guide to the Mouse Proteome (2012/1/3)
The mouse protein identification information in GPMDB has been summarized into a collection of spreadsheets that we are calling the GPMDB Guide to the Mouse Proteome. This guide has the information organized into separate spreadsheets for each chromosome, as well as NT transcripts and mitochrondrial DNA. The protein accession numbers, MGI names and chromosomal coordinates were taken from ENSEMBL v. 65. This edition of the Guide (2012.01.01) is available in the following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/mouse_proteome_guide/
Fruit fly protein acetylation sites (2010/06/30)
We have also compiled a list for the fruit fly proteome acetylation, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
FBpp0076182 Hsp27 M[1]7 S[2]6
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The FlyBase gene name associated with that accession number; and
  3. The acetylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
C. elegans protein phosphorylation sites (2010/08/11)
We have compiled a list of observed phosphorylation sites for the C. elegans proteome, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
C54C6.2 ben-1 Y[208]8 S[338]7 S[364]6
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The WikiGene gene name associated with that accession number; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Fruit fly protein phosphorylation sites (2010/06/30)
We have also compiled a list of observed phosphorylation sites for the fruit fly proteome, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
FBpp0073454 Amun T[250]5 T[272]5 S[274]11 S[502]8
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The FlyBase gene name associated with that accession number; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Yeast protein acetylation sites (2010/06/30)
We have also compiled a list for the yeast proteome acetylation, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
YOL139C CDC33 S[2]8 K[181]6 K[183]6 K[187]6
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The SGD gene name associated with that accession number; and
  3. The acetylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Yeast protein phosphorylation sites (2010/06/30)
We have also compiled a list of observed phosphorylation sites for the yeast proteome, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
YMR303C ADH2 S[177]4 S[230]7 T[236]5 S[278]7
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The SGD gene name associated with that accession number; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Mouse protein acetylation sites (2010/08/10)
We have also compiled a list for the mouse proteome acetylation, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
ENSMUSP00000037348 Acaa2 A[2]8 K[241]5 K[270]5 K[340]7
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The MGI gene name associated with that accession number: there may be many splice variants with the same gene name; and
  3. The acetylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Human protein acetylation sites (2010/07/27)
We have also compiled a list for the human proteome acetylation, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. The list is composed of protein N-terminal and lysine acetylations only.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
ENSP00000264649 ATP6V0A1 G[2]10 K[74]7 K[103]7 K[376]6
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The HGNC gene name associated with that accession number: there may be many splice variants with the same gene name; and
  3. The acetylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Mycobacterium tuberculosis protein phosphorylation sites (2010/08/10)
This list is a compilation of observed serine/threonine phosphorylation sites for the Mycobacterium tuberculosis proteome (strain CDC1551), based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 41 phosphorylation sites on 35 protein sequences, with the following composition:
  1. serine: 18; and
  2. threonine: 23.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
gi|15840936| aconitate hydratase S[716]4
The columns have the following interpretation:
  1. The NCBI gi accession number for the protein splice variant;
  2. The NCBI gene description associated with that accession number; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
We have to again thank all of the data contributors who have made these comprehensive lists possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Mouse protein phosphorylation sites (2010/08/10)
As a companion to the list of known human phosphorylation sites, we have also compiled a similar list for the mouse proteome, based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 22,855 phosphorylation sites on 8,190 protein sequences, with the following composition:
  1. serine: 15,758;
  2. threonine: 3,251; and
  3. tyrosine: 3,846.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
ENSMUSP00000028190 Abl1 Y[253]4 Y[393]9 T[394]6 Y[469]6
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The MGI gene name associated with that accession number: there may be many splice variants with the same gene name; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
We have to again thank all of the data contributors who have made these comprehensive lists possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Human protein phosphorylation sites (2010/07/27)
We have come up with a list of known human phosphorylation sites, based on the data in GPMDB, filtered through the same curation and quality control process that is used to create the Annotated Spectrum Library collection. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 47,613 phosphorylation sites on 16,511 protein splice variant sequences, with the following composition:
  1. serine: 33,644;
  2. threonine: 6,454; and
  3. tyrosine: 7,515.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
ENSP00000344789 ACACA S[66]6 S[117]7 S[350]6 Y[1190]7
The columns have the following interpretation:
  1. The ENSEMBL accession number for the protein splice variant;
  2. The HGNC gene name associated with that accession number: there may be many splice variants with the same gene name; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
We have to thank all of the data contributors who have made this type of comprehensive list possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Observed proteins categorized by Gene Ontology terms (2010/05/01)
The ENSEMBL protein accessions used in GPMDB can be readily assigned to specific Gene Ontology (GO) terms, using ENSEMBL's BioMart utility. These lists for all available GO terms have been constructed for three species:
The lists are divided up into the three main GO categories: biological process; cellular component; and molecular function. For each individual has an entry like:
GO:0006189 [7/7] 'de novo' IMP biosynthetic process
The first column has a link to the list of proteins associated with the GO term accession number. The notation following the accession number "[n/m]" indicates that "n" proteins have been observed in GPMDB out of the "m" proteins in the proteome assigned to this category. The second category is a the controlled vocabulary description of each GO category.
Observed human proteins by tissue type (2010/05/01)
The lists below were constructed from data supplied by the Normal Clinical Tissue Alliance. Proteomics data from selected studies of clinical tissue were analyzed and conservative lists of indentified proteins were constructed. The lists are organized by the best available BRENDA ontology term for the tissue, with the exception of red blood cells, which are not currently in BRENDA.
The lists given below have the proteins in plasma removed (with the exception of the plasma list).
BRENDA ID Description
BTO:0000131 blood plasma
BTO:0000132 blood platelet
BTO:0000133 blood serum
BTO:0000140 bone
BTO:0000142 brain
BTO:0000155 bronchoalveolar lavage
BTO:0000237 cerebrospinal fluid
CL:0000232 erythrocyte
BTO:0000502 gastric fundus
BTO:0001501 hair
BTO:0000723 lens
BTO:0000759 liver
BTO:0000763 lung
BTO:0001202 saliva
BTO:0001419 urine
The 1,000 most observed human & mouse proteins (updated 2010/07/07)
These spreadsheets (top_1000_human_100707.xls and top_1000_mouse_100707.xls) list protein sequences that have been observed most often by GPM users who used the "human" or "mouse" ENSEMBL proteome sequences. The columns in the spreadsheet are as follows:
  1. Column A: ENSEMBL protein accession number for the sequences;
  2. Column B: HUGO Gene Naming Committee symbol for the associated gene;
  3. Column C: NCBI gene number for the associated gene;
  4. Column D: International Protein Index accession number for the sequence;
  5. Column E: SwissProt/Uniprot accession for the sequence;
  6. Column F: the probability that a protein will be found in a dataset (%);
  7. Column G: the base-10 log of the minimum protein expectation value found; &
  8. Column H: a text description of the protein.
A "dataset" corresponds to a submitted set of MS/MS spectra, which results in a GPM result file, so it is roughly equivalent to the set of data from an LC/MS/MS run. A protein can only be observed once in a dataset. The value in Column F was calculated by taking the number of times (ni) that the protein was observed in the approximately 24,000 (N) datasets examined and doing the simple calculation:
pi = 100(ni/N)
Copyright © 2010-2011, The Global Proteome Machine Organization