Data Integrator
|
This tool provides a mapping from Ensembl gene IDs to gene start/end positions for Homo sapiens.
The input to this tool are Ensembl gene identifiers in a data column. Options applicable to more than a single tool are summarized in common command line options.
The program appends at least two columns for each row in the dataset. These are the begin and end genomic coordinate of the gene for the current genome build. These columns are termed Begin
and End
, respectively. Optionally, two more columns can be appended, based on user choice, in the following order:
–inc-strand
Include information on which strand the gene is located. This can either be +
or -
for the forward or for the reverse strand, respectively. The column is called Strand
. –inc-biotype
Include the type of gene product, in Ensembl lingo, this is the Biotype
, which is also the name of the column appended. An unknown Ensembl gene identifier will produce an empty cell.
Let us assume the input file is stored in /tmp/ensg.tsv
,
Ensg name ENSG00000233705 ENSG00000231601 ENSG00000264525 ENSG00000118137 ENSG00000281133 ENSG00000250761 NA
The command line
$ python HSEnsg2gcoords.py -H -c 1 --inc-strand --inc-biotype /tmp/ensg.tsv
tells the module that the input file has a header and that the column with Ensembl gene IDs is the first one. Furthermore, the output should include the strand and the biotype information. This command will result in the following output:
Ensg name Begin End Strand Biotype ENSG00000233705 GRCh38:7:107653968 GRCh38:7:107662151 - antisense ENSG00000231601 GRCh38:10:743992 GRCh38:10:744958 + lincRNA ENSG00000264525 GRCh38:2:66358249 GRCh38:2:66358328 - miRNA ENSG00000118137 GRCh38:11:116835751 GRCh38:11:116837950 - protein_coding ENSG00000281133 GRCh38:1:45580892 GRCh38:1:45580996 - pseudogene ENSG00000250761 GRCh38:5:7707802 GRCh38:5:7749766 - antisense NA -- -- -- --