The Global Proteome Machine Organization
   Index of GPMDB lists
Proteomics often requires the assembly of category wide lists of things. These categories can be proteins associated with particular sequence or biological properties, post-translational modifications, or types of experiments. GPMDB can be used to generate of these lists and this page serves as an index to the lists announced for the system.
   Available lists of things
Post-translational modifications:
  1. C. elegans: phosphorylation
  2. D. melanogaster: phosphorylation
  3. M. musculus: acetylation, phosphorylation
  4. Mycobacterium tuberculosis: phosphorylation
  5. H. sapiens: acetylation, phosphorylation, ubiquitination
  6. S. cerevisiae: acetylation, phosphorylation
Amino acid polymorphisms:
  1. List of all amino acid polymorphisms in GPMDB

GPMDB Guide to the Human Proteome v. 16b (2014/10/22)
The human protein identification information in GPMDB has been summarized into a collection of spreadsheets, GPMDB Guide to the Human Proteome (GHP). This guide has the information organized into separate spreadsheets for each chromosome, as well as mitochrondrial DNA. The protein accession numbers, HGNC names and chromosomal coordinates were taken from ENSEMBL v. 70. Protein sequences corresponding to transcripts labelled as non-stop or nonsense-mediated decay products have been removed. The new NBS v. 2 algorithm was used to determine the evidence codes for this edition. This 16th edition of the Guide (GHP 2014.10.01) is available in the following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/human_proteome_guide/
GPMDB Guide to the Mouse Proteome v. 16b (2014/10/22)
The mouse protein identification information in GPMDB has been summarized into a collection of spreadsheets, the GPMDB Guide to the Mouse Proteome (GMP). This guide has the information organized into separate spreadsheets for each chromosome, as well as mitochrondrial DNA. The protein accession numbers, MGI names and chromosomal coordinates were taken from ENSEMBL v. 69. Protein sequences corresponding to transcripts labelled as non-stop or nonsense-mediated decay products have been removed. The new NBS v. 2 algorithm was used to determine the evidence codes for this edition. This 16th edition of the Guide (GMP 2014.10.01) is available in the following formats:
The files are also available at the GPM FTP site:
ftp://ftp.thegpm.org/projects/annotation/mouse_proteome_guide/
C. elegans protein phosphorylation sites (2010/08/11)
These files represent a comprehensive list of all C. elegans protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 25,076
total genes 19,788
phospho-proteins 997
phospho-genes 609
phosphorylation sites 3,069
Fruit fly protein phosphorylation sites (2013/05/13)
These files represent a comprehensive list of all D. melanogaster protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 21,223
total genes 13,937
phospho-proteins 2,283
phospho-genes 1,565
phosphorylation sites 8,774
Yeast protein acetylation sites (2013/06/17)
These files represent a comprehensive list of all S. cerevisiae protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation and a merged list of all chromosomes are now available by FTP for lysine & N-terminal acetylation. A description of the format of these files is available in the associated "README.txt" file in in the same directory. A short summary of the number of acetylated proteins, genes and sites of each typeis given "stats/stats.txt" file.
Yeast protein phosphorylation sites (2013/05/13)
These files represent a comprehensive list of all S. cerevisiae protein phosphorylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for a merged list of all chromosomes is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 6,692
total genes 6,692
phospho-proteins 2,449
phospho-genes 2,449
phosphorylation sites 16,664
Mouse protein acetylation sites (2013/06/17)
These files represent a comprehensive list of all Mus muscullus protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation and a merged list of all chromosomes are now available by FTP for lysine & N-terminal acetylation. A description of the format of these files is available in the associated "README.txt" file in in the same directory. A short summary of the number of acetylated proteins, genes and sites of each typeis given "stats/stats.txt" file.
Human protein acetylation sites (2013/06/17)
These files represent a comprehensive list of all Homo sapiens protein N-terminal and lysine acetylation sites represented by good quality data in GPMDB. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation and a merged list of all chromosomes are now available by FTP for lysine & N-terminal acetylation. A description of the format of these files is available in the associated "README.txt" file in in the same directory. A short summary of the number of acetylated proteins, genes and sites of each type is given "stats/stats.txt" file.
Mycobacterium tuberculosis protein phosphorylation sites (2010/08/10)
This list is a compilation of observed serine/threonine phosphorylation sites for the Mycobacterium tuberculosis proteome (strain CDC1551), based on the data in GPMDB. This list is available in Excel spreadsheet, tab-separated text and HTML formats. It contains 41 phosphorylation sites on 35 protein sequences, with the following composition:
  1. serine: 18; and
  2. threonine: 23.
Each ENSEMBL splice variant protein accession number has a listing of all observed sites in a single row, that looks like the following:
gi|15840936| aconitate hydratase S[716]4
The columns have the following interpretation:
  1. The NCBI gi accession number for the protein splice variant;
  2. The NCBI gene description associated with that accession number; and
  3. The phosphorylated residue in the notation "X[nnn]C", where "X" is the residue type, "nnn" is the sequence position of the residue and "C" is a relative confidence number for the assignment (higher is better).
We have to again thank all of the data contributors who have made these comprehensive lists possible. When using this type of information, please use normal caution. Click here for our recommendations for using lists of site assignments.
Mouse protein phosphorylation sites (2013/05/13)
These files represent a comprehensive list of all mouse protein phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 71 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 45,557
total genes 22,796
phospho-proteins 10,134
phospho-genes 5,277
phosphorylation sites 49,416
Human protein phosphorylation sites (2013/05/12)
As part of our contribution to the Human Proteome Project, we have compiled a comprehensive list of all human protein phosphorylation sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 70 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of phospho-proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 87,222
total genes 23,287
phospho-proteins 22,621
phospho-genes 7,563
phosphorylation sites 142,832
Human protein ubiquitination sites (2013/09/01)
We have compiled a comprehensive list of all human protein ubiquitination sites represented by good quality data in GPMDB. This list has been subdivided on a chromosome-by-chromosome basis, using ENSEMBL v. 70 as the source of the protein and gene sequences. All of the splice variants listed by ENSEMBL have been annotated.
The files associated with the annotation for each chromosome (and a merged list of all chromosomes) is now available by FTP. A description of the format of these files (README.txt) is in the same directory. A short summary of the number of ubiquitin-modified proteins, genes and sites is given here. For unique protein sequences in the proteome, the overall totals are as follows:
unique proteins 87,222
total genes 23,287
ubiquitin-modified proteins 21,436
ubiquitin-modified genes 6,282
ubiquitin-modified sites 77,684
Amino acid polymorphisms in GPMDB (2013/1/2)
The GPM has been generating information about amino acid polymorphisms in model species for the last 5 years. This information has been recorded in GPMDB, which as of Jan. 1, 2013 had approximately 4.8 million observations of amino acid polymorphisms. The information about these observations has been dumped into a file, using either tab-separated value (.txt) or SQLite (.db) formats via FTP. The specific entries in these files are as follows:
SNP id GPMDB obs. id HGVS id
rs34037627 199905983 ENSP00000333994:p.V55D
If available, the first column corresponds to an identifier for the associated single nucleotide polymorphism. In cases were there was no associated SNP information the "HGVS id" information was repeated in this column. The "GPMDB obs. id" is the unique id for the specific peptide sequence identification that was the evidence for each polymorphism.
Observed proteins categorized by Gene Ontology terms (2010/05/01)
The ENSEMBL protein accessions used in GPMDB can be readily assigned to specific Gene Ontology (GO) terms, using ENSEMBL's BioMart utility. These lists for all available GO terms have been constructed for three species:
The lists are divided up into the three main GO categories: biological process; cellular component; and molecular function. For each individual has an entry like:
GO:0006189 [7/7] 'de novo' IMP biosynthetic process
The first column has a link to the list of proteins associated with the GO term accession number. The notation following the accession number "[n/m]" indicates that "n" proteins have been observed in GPMDB out of the "m" proteins in the proteome assigned to this category. The second category is a the controlled vocabulary description of each GO category.
Observed human proteins by tissue type (2010/05/01)
The lists below were constructed from data supplied by the Normal Clinical Tissue Alliance. Proteomics data from selected studies of clinical tissue were analyzed and conservative lists of indentified proteins were constructed. The lists are organized by the best available BRENDA ontology term for the tissue, with the exception of red blood cells, which are not currently in BRENDA.
The lists given below have the proteins in plasma removed (with the exception of the plasma list).
BRENDA ID Description
BTO:0000131 blood plasma
BTO:0000132 blood platelet
BTO:0000133 blood serum
BTO:0000140 bone
BTO:0000142 brain
BTO:0000155 bronchoalveolar lavage
BTO:0000237 cerebrospinal fluid
CL:0000232 erythrocyte
BTO:0000502 gastric fundus
BTO:0001501 hair
BTO:0000723 lens
BTO:0000759 liver
BTO:0000763 lung
BTO:0001202 saliva
BTO:0001419 urine
The 1,000 most observed human & mouse proteins (updated 2010/07/07)
These spreadsheets (top_1000_human_100707.xls and top_1000_mouse_100707.xls) list protein sequences that have been observed most often by GPM users who used the "human" or "mouse" ENSEMBL proteome sequences. The columns in the spreadsheet are as follows:
  1. Column A: ENSEMBL protein accession number for the sequences;
  2. Column B: HUGO Gene Naming Committee symbol for the associated gene;
  3. Column C: NCBI gene number for the associated gene;
  4. Column D: International Protein Index accession number for the sequence;
  5. Column E: SwissProt/Uniprot accession for the sequence;
  6. Column F: the probability that a protein will be found in a dataset (%);
  7. Column G: the base-10 log of the minimum protein expectation value found; &
  8. Column H: a text description of the protein.
A "dataset" corresponds to a submitted set of MS/MS spectra, which results in a GPM result file, so it is roughly equivalent to the set of data from an LC/MS/MS run. A protein can only be observed once in a dataset. The value in Column F was calculated by taking the number of times (ni) that the protein was observed in the approximately 24,000 (N) datasets examined and doing the simple calculation:
pi = 100(ni/N)
Copyright 2010-2011, The Global Proteome Machine Organization