Data Integrator
gene2canonexons - Extract exon coordinates of a Gene

gene2canonexons extracts exon coordinates from the canonical transcript of a Human Ensembl gene ID. If no canonical transcript is present, the transcript with the highest number of exons is selected instead. The "Canonical" column can be used to discriminate between canonical or otherwise in the output. gene2canonexons can also write in DesignStudio format directly.

This tool utilizes the Ensembl Perl API.

Input

The input consists of just one column for the Human Ensembl gene id.

Options applicable to more than a single tool are summarized in common command line options.

Output

Normally, several columns are appended in the output file, like the other D-Integrator tools.

The appended columns are:

  • Transcript ID: Ensembl transcript ID which has been selected.
  • Canonical: a boolean indicating the transcript is the canonical transcript (1) or the transcript with the largest number of exons (0).
  • Total exons: total number of exons for this transcript.
  • Exon number: the current exon counter.
  • Exon coordinates: genomic coordinates indicating the location of the current exon.

gene2canonexons can also write in DesignStudio format directly, by supplying the --design-studio parameter. The parameter requires a single argument, comprising of 3 strings separated by comma, respectively:

  • TargetType
  • Density
  • Labels

These strings are arbitrary, can be empty or contain spaces, and are simply replicated as-is for all output lines.

Example

Given the input file /tmp/genes.tsv

Ensembl Gene ID
ENSG00000126803
ENSG00000006530
ENSG00000132541
ENSG00000113013

the command

$ ./gene2canonexons.pl -H -c 1 </tmp/genes.tsv

will produce the following output (using Ensembl r71 database):

Ensembl Gene ID Transcript ID   Canonical Total exons Exon number Exon coordinates
ENSG00000126803 ENST00000394709 0         2           1           GRCh37:14:65002623-65002693
ENSG00000126803 ENST00000394709 0         2           2           GRCh37:14:65007563-65009955
ENSG00000006530 ENST00000473247 0         17          1           GRCh37:7:141251195-141251234
ENSG00000006530 ENST00000473247 0         17          2           GRCh37:7:141255253-141255367
ENSG00000006530 ENST00000473247 0         17          3           GRCh37:7:141261907-141261910
ENSG00000006530 ENST00000473247 0         17          4           GRCh37:7:141292946-141292985
ENSG00000006530 ENST00000473247 0         17          5           GRCh37:7:141296362-141296441
ENSG00000006530 ENST00000473247 0         17          6           GRCh37:7:141301005-141301080
ENSG00000006530 ENST00000473247 0         17          7           GRCh37:7:141310995-141311087
ENSG00000006530 ENST00000473247 0         17          8           GRCh37:7:141313946-141313978
ENSG00000006530 ENST00000473247 0         17          9           GRCh37:7:141315271-141315365
ENSG00000006530 ENST00000473247 0         17          10          GRCh37:7:141321532-141321601
...             ...             ...       ...         ...         ...

The same file can be output in DesignStudio format with the following command:

$ ./gene2canonexons.pl -H -c 1 --design-studio Exon,Standard,-- </tmp/genes.tsv

which yields:

Chromosome,StartCoordinate,StopCoordinate,TargetType,Density,Labels
14,65002623,65002693,Exon,Standard,--
14,65007563,65009955,Exon,Standard,--
7,141251195,141251234,Exon,Standard,--
7,141255253,141255367,Exon,Standard,--
7,141261907,141261910,Exon,Standard,--
7,141292946,141292985,Exon,Standard,--
7,141296362,141296441,Exon,Standard,--
7,141301005,141301080,Exon,Standard,--
7,141310995,141311087,Exon,Standard,--
...