Data Integrator
HSEnsgProteinMapper - Ensembl gene ID to protein ID mapping

Maps between Ensembl gene ID and a variety of protein IDs for Homo sapiens. This tool therefore enables to enter the world of protein identifiers.

Input

The tool's input is a column which holds either an Ensembl gene or transcript ID, a Consensus CDS ID, or a protein ID from either Ensembl or UniProt/SwissProt. Gene identifiers from other databases may be converted to Ensembl gene IDs using the tool described in HSGeneIdConverter - Homo sapiens gene ID converter. Input datasets can be divided into three distinct blocks:

Gene Ensembl protein UniProt/SwissProt
Ensembl gene ID Ensembl protein ID
Ensembl transcript ID
NCBI CCDS ID
Accession number
Entry name
UniProt/TrEmbl
Accession number
Entry name

Usually mapping is carried out using the Ensembl gene ID, as mentioned in HSGeneIdConverter - Homo sapiens gene ID converter. However, due to unique relationships between Ensembl protein, transcript and CCDS identifiers, we have made exceptions to this rule. If any of the three above mentioned identifiers is mapped to any of the remaining two, the mapping is executed through their natural link, the Ensembl transcript ID. Uniprot/SwissProt entries are curated forms of UniProt/TrEmbl entries, so they mutually exclude each other and cannot be mapped to each other. Mapping entities is generally only possible between blocks, as Mapping between TrEmbl or SwissProt entries (accession number to entry name and vice versa) is possible, though.

Options applicable to more than a single tool are summarized in common command line options.

Possible input (and output) formats of available data types are summarized in the following table, for further information on UniProt identifiers please visit the UniProt FAQ on this.

Data type Output column header nameExample
Ensembl gene ID Human Ensembl Gene ID ENSG00000144199
Ensembl transcript ID Human Ensembl Transcript ID ENST00000361453
Ensembl protein ID Human Ensembl Protein ID ENSP00000410470
NCBI Consensus Coding Sequence(CCDS) ID Human CCDS ID CCDS41992
UniProt/SwissProt accession number UniProt/SwissProt Accession Q9NY61
UniProt/SwissProt entry name UniProt/SwissProt Entry Name AATF_HUMAN
UniProt/TrEmbl accession number UniProt/TrEMBL Accession Q12843
UniProt/TrEmbl entry name UniProt/TrEmbl Entry Name Q12843_HUMAN

Output example

Output data types are the same as input data is, this has been documented in the Input section. We provide an example where we start mapping from the Ensembl gene identifier and gradually add protein identifier to the table. The input file /tmp/ensg2protein.tsv is given as follows:

Ensembl Gene ID
ENSG00000185800
ENSG00000234745
ENSG00000100241
ENSG00000154165
ENSG00000113946
ENSG00000000001

And with the following command

chris-cmd$ ./HSEnsgProteinMapper.py -H –in ensg –out sp -c 1 /tmp/ensg2protein.tsv | ./HSEnsgProteinMapper.py -H –in sp –out spid -c 2 -

we arrive at the output file

Ensembl Gene ID   UniProt/SwissProt Accession   UniProt/SwissProt Entry Name
ENSG00000185800   Q09019   DMWD_HUMAN
ENSG00000234745   P01889   1B07_HUMAN
ENSG00000234745   P01889   1B42_HUMAN
ENSG00000234745   P01889   1B48_HUMAN
ENSG00000234745   P01889   1B67_HUMAN
ENSG00000234745   P01889   1B73_HUMAN
ENSG00000234745   P01889   1B81_HUMAN
ENSG00000234745   P30480   1B07_HUMAN
ENSG00000234745   P30480   1B42_HUMAN
ENSG00000234745   P30480   1B48_HUMAN
ENSG00000234745   P30480   1B67_HUMAN
ENSG00000234745   P30480   1B73_HUMAN
ENSG00000234745   P30480   1B81_HUMAN
ENSG00000234745   P30486   1B07_HUMAN
ENSG00000234745   P30486   1B42_HUMAN
ENSG00000234745   P30486   1B48_HUMAN
ENSG00000234745   P30486   1B67_HUMAN
ENSG00000234745   P30486   1B73_HUMAN
ENSG00000234745   P30486   1B81_HUMAN
ENSG00000234745   Q29836   1B07_HUMAN
ENSG00000234745   Q29836   1B42_HUMAN
ENSG00000234745   Q29836   1B48_HUMAN
ENSG00000234745   Q29836   1B67_HUMAN
ENSG00000234745   Q29836   1B73_HUMAN
ENSG00000234745   Q29836   1B81_HUMAN
ENSG00000234745   Q31610   1B07_HUMAN
ENSG00000234745   Q31610   1B42_HUMAN
ENSG00000234745   Q31610   1B48_HUMAN
ENSG00000234745   Q31610   1B67_HUMAN
ENSG00000234745   Q31610   1B73_HUMAN
ENSG00000234745   Q31610   1B81_HUMAN
ENSG00000234745   Q31612   1B07_HUMAN
ENSG00000234745   Q31612   1B42_HUMAN
ENSG00000234745   Q31612   1B48_HUMAN
ENSG00000234745   Q31612   1B67_HUMAN
ENSG00000234745   Q31612   1B73_HUMAN
ENSG00000234745   Q31612   1B81_HUMAN
ENSG00000100241   O95248   MTMR5_HUMAN
ENSG00000154165   P49685   GPR15_HUMAN
ENSG00000113946   Q9Y5I7   CLD16_HUMAN
ENSG00000000001   --       --

If an identifier cannot be mapped to the desired target data base, the empty cell identifier is output.