Data Integrator
|
gene2canonexons
extracts exon coordinates from the canonical transcript of a Human Ensembl gene ID. If no canonical transcript is present, the transcript with the highest number of exons is selected instead. The "Canonical" column can be used to discriminate between canonical or otherwise in the output. gene2canonexons
can also write in DesignStudio
format directly.
This tool utilizes the Ensembl Perl API.
The input consists of just one column for the Human Ensembl gene id.
Options applicable to more than a single tool are summarized in common command line options.
Normally, several columns are appended in the output file, like the other D-Integrator tools.
The appended columns are:
Transcript ID
: Ensembl transcript ID which has been selected.Canonical
: a boolean indicating the transcript is the canonical transcript (1) or the transcript with the largest number of exons (0).Total exons
: total number of exons for this transcript.Exon number
: the current exon counter.Exon coordinates
: genomic coordinates indicating the location of the current exon.gene2canonexons
can also write in DesignStudio
format directly, by supplying the --design-studio
parameter. The parameter requires a single argument, comprising of 3 strings separated by comma, respectively:
TargetType
Density
Labels
These strings are arbitrary, can be empty or contain spaces, and are simply replicated as-is for all output lines.
Given the input file /tmp/genes.tsv
Ensembl Gene ID ENSG00000126803 ENSG00000006530 ENSG00000132541 ENSG00000113013
the command
$ ./gene2canonexons.pl -H -c 1 </tmp/genes.tsv
will produce the following output (using Ensembl r71 database):
Ensembl Gene ID Transcript ID Canonical Total exons Exon number Exon coordinates ENSG00000126803 ENST00000394709 0 2 1 GRCh37:14:65002623-65002693 ENSG00000126803 ENST00000394709 0 2 2 GRCh37:14:65007563-65009955 ENSG00000006530 ENST00000473247 0 17 1 GRCh37:7:141251195-141251234 ENSG00000006530 ENST00000473247 0 17 2 GRCh37:7:141255253-141255367 ENSG00000006530 ENST00000473247 0 17 3 GRCh37:7:141261907-141261910 ENSG00000006530 ENST00000473247 0 17 4 GRCh37:7:141292946-141292985 ENSG00000006530 ENST00000473247 0 17 5 GRCh37:7:141296362-141296441 ENSG00000006530 ENST00000473247 0 17 6 GRCh37:7:141301005-141301080 ENSG00000006530 ENST00000473247 0 17 7 GRCh37:7:141310995-141311087 ENSG00000006530 ENST00000473247 0 17 8 GRCh37:7:141313946-141313978 ENSG00000006530 ENST00000473247 0 17 9 GRCh37:7:141315271-141315365 ENSG00000006530 ENST00000473247 0 17 10 GRCh37:7:141321532-141321601 ... ... ... ... ... ...
The same file can be output in DesignStudio
format with the following command:
$ ./gene2canonexons.pl -H -c 1 --design-studio Exon,Standard,-- </tmp/genes.tsv
which yields:
Chromosome,StartCoordinate,StopCoordinate,TargetType,Density,Labels 14,65002623,65002693,Exon,Standard,-- 14,65007563,65009955,Exon,Standard,-- 7,141251195,141251234,Exon,Standard,-- 7,141255253,141255367,Exon,Standard,-- 7,141261907,141261910,Exon,Standard,-- 7,141292946,141292985,Exon,Standard,-- 7,141296362,141296441,Exon,Standard,-- 7,141301005,141301080,Exon,Standard,-- 7,141310995,141311087,Exon,Standard,-- ...