Data Integrator
|
Maps between Ensembl gene ID and a variety of protein IDs for Homo sapiens. This tool therefore enables to enter the world of protein identifiers.
The tool's input is a column which holds either an Ensembl gene or transcript ID, a Consensus CDS ID, or a protein ID from either Ensembl or UniProt/SwissProt. Gene identifiers from other databases may be converted to Ensembl gene IDs using the tool described in HSGeneIdConverter - Homo sapiens gene ID converter. Input datasets can be divided into three distinct blocks:
Gene | Ensembl protein | UniProt/SwissProt |
---|---|---|
Ensembl gene ID | Ensembl protein ID Ensembl transcript ID NCBI CCDS ID | Accession number |
Entry name | ||
UniProt/TrEmbl | ||
Accession number | ||
Entry name |
Usually mapping is carried out using the Ensembl gene ID, as mentioned in HSGeneIdConverter - Homo sapiens gene ID converter. However, due to unique relationships between Ensembl protein, transcript and CCDS identifiers, we have made exceptions to this rule. If any of the three above mentioned identifiers is mapped to any of the remaining two, the mapping is executed through their natural link, the Ensembl transcript ID. Uniprot/SwissProt entries are curated forms of UniProt/TrEmbl entries, so they mutually exclude each other and cannot be mapped to each other. Mapping entities is generally only possible between blocks, as Mapping between TrEmbl or SwissProt entries (accession number to entry name and vice versa) is possible, though.
Options applicable to more than a single tool are summarized in common command line options.
Possible input (and output) formats of available data types are summarized in the following table, for further information on UniProt identifiers please visit the UniProt FAQ on this.
Data type | Output column header name | Example |
---|---|---|
Ensembl gene ID | Human Ensembl Gene ID | ENSG00000144199 |
Ensembl transcript ID | Human Ensembl Transcript ID | ENST00000361453 |
Ensembl protein ID | Human Ensembl Protein ID | ENSP00000410470 |
NCBI Consensus Coding Sequence(CCDS) ID | Human CCDS ID | CCDS41992 |
UniProt/SwissProt accession number | UniProt/SwissProt Accession | Q9NY61 |
UniProt/SwissProt entry name | UniProt/SwissProt Entry Name | AATF_HUMAN |
UniProt/TrEmbl accession number | UniProt/TrEMBL Accession | Q12843 |
UniProt/TrEmbl entry name | UniProt/TrEmbl Entry Name | Q12843_HUMAN |
Output data types are the same as input data is, this has been documented in the Input section. We provide an example where we start mapping from the Ensembl gene identifier and gradually add protein identifier to the table. The input file /tmp/ensg2protein.tsv
is given as follows:
Ensembl Gene ID ENSG00000185800 ENSG00000234745 ENSG00000100241 ENSG00000154165 ENSG00000113946 ENSG00000000001
And with the following command
chris-cmd$ ./HSEnsgProteinMapper.py -H –in ensg –out sp -c 1 /tmp/ensg2protein.tsv | ./HSEnsgProteinMapper.py -H –in sp –out spid -c 2 -
we arrive at the output file
Ensembl Gene ID UniProt/SwissProt Accession UniProt/SwissProt Entry Name ENSG00000185800 Q09019 DMWD_HUMAN ENSG00000234745 P01889 1B07_HUMAN ENSG00000234745 P01889 1B42_HUMAN ENSG00000234745 P01889 1B48_HUMAN ENSG00000234745 P01889 1B67_HUMAN ENSG00000234745 P01889 1B73_HUMAN ENSG00000234745 P01889 1B81_HUMAN ENSG00000234745 P30480 1B07_HUMAN ENSG00000234745 P30480 1B42_HUMAN ENSG00000234745 P30480 1B48_HUMAN ENSG00000234745 P30480 1B67_HUMAN ENSG00000234745 P30480 1B73_HUMAN ENSG00000234745 P30480 1B81_HUMAN ENSG00000234745 P30486 1B07_HUMAN ENSG00000234745 P30486 1B42_HUMAN ENSG00000234745 P30486 1B48_HUMAN ENSG00000234745 P30486 1B67_HUMAN ENSG00000234745 P30486 1B73_HUMAN ENSG00000234745 P30486 1B81_HUMAN ENSG00000234745 Q29836 1B07_HUMAN ENSG00000234745 Q29836 1B42_HUMAN ENSG00000234745 Q29836 1B48_HUMAN ENSG00000234745 Q29836 1B67_HUMAN ENSG00000234745 Q29836 1B73_HUMAN ENSG00000234745 Q29836 1B81_HUMAN ENSG00000234745 Q31610 1B07_HUMAN ENSG00000234745 Q31610 1B42_HUMAN ENSG00000234745 Q31610 1B48_HUMAN ENSG00000234745 Q31610 1B67_HUMAN ENSG00000234745 Q31610 1B73_HUMAN ENSG00000234745 Q31610 1B81_HUMAN ENSG00000234745 Q31612 1B07_HUMAN ENSG00000234745 Q31612 1B42_HUMAN ENSG00000234745 Q31612 1B48_HUMAN ENSG00000234745 Q31612 1B67_HUMAN ENSG00000234745 Q31612 1B73_HUMAN ENSG00000234745 Q31612 1B81_HUMAN ENSG00000100241 O95248 MTMR5_HUMAN ENSG00000154165 P49685 GPR15_HUMAN ENSG00000113946 Q9Y5I7 CLD16_HUMAN ENSG00000000001 -- --
If an identifier cannot be mapped to the desired target data base, the empty cell identifier is output.