GPMDB Guide to the Human Proteome (GHP)

A chromosome-by-chromosome look at proteins observed, 2004-2016

Version 22: 2016.7.01

Editor: Ronald C. Beavis

Introduction

GPMDB began recording information about the Human Proteome on January 1, 2004. It was the first system to use the ENSEMBL sequence annotation system as a source of protein sequences and as a result it now has ten years of retrospective proteomics information.

Methodology

The spreadsheets were assembled using the ENSEMBL protein splice variant accession numbers for ENSEMBL build 76 (human genome assembly GRCh38). The accession numbers were separated into groups on a chromosome-by-chromosome basis and GPMDB was queried to determine which of those accession numbers had been observed. If an accession number had a GPMDB record, the following data was extracted and represented in the attached spreadsheets:

1. rank: this number is the numerical rating of the best observation for that protein sequence, based on its log(e) value (see #5 below);

2. ENSEMBL splice: the ENSEMBL v. 76 protein accession number;

3. ENSEMBL gene: the ENSEMBL v. 76 gene accession number;

4. # obs.: the number of times the protein has been observed;

5. log(e): the lowest (best) expectation value observed for that protein;

6. HGNC: the human genome naming commission abbreviation for the gene associated with the protein;

7. EC: evidence code (described at http://wiki.thegpm.org/wiki/GPMDB_evidence_codes);

8. Start: the first nucleic acid residue in the associated gene, in chromosome coordinates;

9. End: the last nucleic acid residue in the associated gene, in chromosome coordinates;

10. Strand: the direction for reading the gene from the chromosome;

11. Band: the chromosomal band that the gene occupies; and

12. Description: a text description of the associated gene's function.

13. TSL: Transcript support level (1 is best).

In addition to the 22 autosomal chromosomes and the 2 sex chromosomes, a separate spreadsheet was compiled for mitochondrial DNA (MT) and genes present on haplotypes or patches (OTHER). The EC calculations were made with the NBS 2 algorithm.

Notes

1. No attempt has been made to use protein accession numbers not found in ENSEMBL v. 76. GPMDB has recorded information from many versions of ENSEMBL, some of which contained accession numbers no longer present in v. 76. No algorithms have been used to attempt to re-use that information by projecting protein sequences or chromosomal locations back onto the current ENSEMBL assembly.

2. The HGNC, Start, End, Strand, Band and Description information recorded here was taken directly from the ENSEMBL BioMart system and recorded without further editing or curation.

3. No numerical cutoffs have been used in assembling these tables, except that to be counted a protein identification must have had a expection value less than or equal to 1, i.e., log(e) ≤ 0.

4. Only protein coding accessions are included, except those predicted to be subject to nonsense decay.