Data Integrator
|
For a given location on the human genome, find genes in the flanking region of this position. Output of this tool is used by other Data Integrator modules like the linkage disequlibrium tools SNPGeneGlobalLDChecker - Linkage disequilibrium block membership for genomic coordinates and gcoords2ld - Linkage disequilibrium for genomic coordinate/gene pairs.
This tool utilizes the Ensembl Perl API.
Only a single column is needed, which contains the genomic coordinate. A search radius defines the number of upstream and downstream base pairs: If any gene is even partially contained within this region, it will be listed in the output. An optional detail output mode prints the signed distance from the genomic coordinate to the gene start and end. As an extra option, the gene's distance rank can be determined. This is the position of the gene's start or end (whichever is smaller) distance to the genomic coordinate when sorted by the absolute value of this distance.
Options applicable to more than a single tool are summarized in common command line options.
Depending on the selected options, the following columns will be output in the order as listed below:
Human Ensembl Gene ID
Ensembl gene identifier for the neighboring human gene. Distance to gene
Absolute distance to the gene start or end position, whichever is lower. It is zero (0), if the query position is within the gene itself. Distance to gene start
The difference between the query position (ie. the genomic coordinate from the input column) and the gene start position. The sign is negative, if the gene's start coordinate is less than the genomic coordinate. Please note that this notion does not account for the strand, ie. the sign cannot be interpreted as upstream or downstream, as the sign is always negative if the gene's start coordinate is smaller than that given by the genomic coordinate from the query. Distance to gene end
The signed difference between the genomic coordinate from the query column and the gene's end position. See item above for more information. Strand
The strand on which the gene is located. A plus sign designates the forward strand, a minus sign the reverse strand. Rank
This is the rank of the gene when sorted by its absolute distance to the query genomic coordinate, given by Distance to gene
. Given the genomic coordinate GRCh37:10:72614000
, which is located inside gene SGPL1, with a 78,000 search radius the following command
echo $'Gen. Coord.\nGRCh37:10:72614000' | ./gcoords2genes.pl -H -c 1 -r 78000 –details –rank
will find the following genes, their distances and their distance ranks:
Gen. Coord. Human Ensembl Gene ID Distance to gene Distance to gene start Distance to gene end Strand Rank GRCh37:10:72614000 ENSG00000166224 0 -38283 26930 + 1 GRCh37:10:72614000 ENSG00000237998 8528 8528 12426 - 2 GRCh37:10:72614000 ENSG00000166228 28037 28037 34541 - 3 GRCh37:10:72614000 ENSG00000233104 36032 -36677 -36032 + 4 GRCh37:10:72614000 ENSG00000231366 59721 -60014 -59721 - 5 GRCh37:10:72614000 ENSG00000166220 68843 -83005 -68843 - 6 GRCh37:10:72614000 ENSG00000237047 75682 75682 77814 + 7
This example illustrates the issues that have been mentioned above. First, gene ENSG00000166224
is associated with HGNC symbol SGPL1, so the closest gene to the queried genomic location is SGPL1 itself. Since this coordinate is contained within the gene, the absolute distance, Distance to gene
, by definition is set to zero. This can also be seen from the gene start and end positions, as they have alternate signs. Next closest gene is ENSG00000237998
, a novel antisense gene on the reverse strand. Even though this gene is located on the reverse strand, its distance is measured with respect to the forward strand of the chromosome, so it is +8,528 bp further up from the query location. On the other hand, gene ENSG00000233104
is located -36,032 bp from the query position (on the forward strand, as always), as can be inferred from the negative sign of the gene start/end pair.