Data Integrator
gcoords2genes - Genes in the vicinity of a genomic coordinate

For a given location on the human genome, find genes in the flanking region of this position. Output of this tool is used by other Data Integrator modules like the linkage disequlibrium tools SNPGeneGlobalLDChecker - Linkage disequilibrium block membership for genomic coordinates and gcoords2ld - Linkage disequilibrium for genomic coordinate/gene pairs.

This tool utilizes the Ensembl Perl API.

Input

Only a single column is needed, which contains the genomic coordinate. A search radius defines the number of upstream and downstream base pairs: If any gene is even partially contained within this region, it will be listed in the output. An optional detail output mode prints the signed distance from the genomic coordinate to the gene start and end. As an extra option, the gene's distance rank can be determined. This is the position of the gene's start or end (whichever is smaller) distance to the genomic coordinate when sorted by the absolute value of this distance.

Options applicable to more than a single tool are summarized in common command line options.

Output

Depending on the selected options, the following columns will be output in the order as listed below:

  • Human Ensembl Gene ID Ensembl gene identifier for the neighboring human gene.
  • Distance to gene Absolute distance to the gene start or end position, whichever is lower. It is zero (0), if the query position is within the gene itself.
  • Distance to gene start The difference between the query position (ie. the genomic coordinate from the input column) and the gene start position. The sign is negative, if the gene's start coordinate is less than the genomic coordinate. Please note that this notion does not account for the strand, ie. the sign cannot be interpreted as upstream or downstream, as the sign is always negative if the gene's start coordinate is smaller than that given by the genomic coordinate from the query.
  • Distance to gene end The signed difference between the genomic coordinate from the query column and the gene's end position. See item above for more information.
  • Strand The strand on which the gene is located. A plus sign designates the forward strand, a minus sign the reverse strand.
  • Rank This is the rank of the gene when sorted by its absolute distance to the query genomic coordinate, given by Distance to gene.

Example

Given the genomic coordinate GRCh37:10:72614000, which is located inside gene SGPL1, with a 78,000 search radius the following command

echo $'Gen. Coord.\nGRCh37:10:72614000' | ./gcoords2genes.pl -H -c 1 -r 78000 –details –rank

will find the following genes, their distances and their distance ranks:

Gen. Coord.         Human Ensembl Gene ID  Distance to gene  Distance to gene start  Distance to gene end  Strand  Rank
GRCh37:10:72614000  ENSG00000166224        0                 -38283                  26930                 +       1
GRCh37:10:72614000  ENSG00000237998        8528              8528                    12426                 -       2
GRCh37:10:72614000  ENSG00000166228        28037             28037                   34541                 -       3
GRCh37:10:72614000  ENSG00000233104        36032             -36677                  -36032                +       4
GRCh37:10:72614000  ENSG00000231366        59721             -60014                  -59721                -       5
GRCh37:10:72614000  ENSG00000166220        68843             -83005                  -68843                -       6
GRCh37:10:72614000  ENSG00000237047        75682             75682                   77814                 +       7

This example illustrates the issues that have been mentioned above. First, gene ENSG00000166224 is associated with HGNC symbol SGPL1, so the closest gene to the queried genomic location is SGPL1 itself. Since this coordinate is contained within the gene, the absolute distance, Distance to gene, by definition is set to zero. This can also be seen from the gene start and end positions, as they have alternate signs. Next closest gene is ENSG00000237998, a novel antisense gene on the reverse strand. Even though this gene is located on the reverse strand, its distance is measured with respect to the forward strand of the chromosome, so it is +8,528 bp further up from the query location. On the other hand, gene ENSG00000233104 is located -36,032 bp from the query position (on the forward strand, as always), as can be inferred from the negative sign of the gene start/end pair.