Data Integrator
HSEnsg2gcoords - Retrieve genomic coordinates for genes

This tool provides a mapping from Ensembl gene IDs to gene start/end positions for Homo sapiens.

Input

The input to this tool are Ensembl gene identifiers in a data column. Options applicable to more than a single tool are summarized in common command line options.

Output

The program appends at least two columns for each row in the dataset. These are the begin and end genomic coordinate of the gene for the current genome build. These columns are termed Begin and End, respectively. Optionally, two more columns can be appended, based on user choice, in the following order:

  • –inc-strand Include information on which strand the gene is located. This can either be + or - for the forward or for the reverse strand, respectively. The column is called Strand.
  • –inc-biotype Include the type of gene product, in Ensembl lingo, this is the Biotype, which is also the name of the column appended.

An unknown Ensembl gene identifier will produce an empty cell.

Let us assume the input file is stored in /tmp/ensg.tsv,

Ensg name
ENSG00000233705
ENSG00000231601
ENSG00000264525
ENSG00000118137
ENSG00000281133
ENSG00000250761
NA

The command line

$ python HSEnsg2gcoords.py -H -c 1 --inc-strand --inc-biotype /tmp/ensg.tsv

tells the module that the input file has a header and that the column with Ensembl gene IDs is the first one. Furthermore, the output should include the strand and the biotype information. This command will result in the following output:

      Ensg name               Begin                 End Strand        Biotype
ENSG00000233705  GRCh38:7:107653968  GRCh38:7:107662151      -      antisense
ENSG00000231601    GRCh38:10:743992    GRCh38:10:744958      +        lincRNA
ENSG00000264525   GRCh38:2:66358249   GRCh38:2:66358328      -          miRNA
ENSG00000118137 GRCh38:11:116835751 GRCh38:11:116837950      - protein_coding
ENSG00000281133   GRCh38:1:45580892   GRCh38:1:45580996      -     pseudogene
ENSG00000250761    GRCh38:5:7707802    GRCh38:5:7749766      -      antisense
             NA                  --                  --     --             --