Data Integrator
Interval2Genes - Finding genes in an interval

Performs search for (protein coding) genes for a given interval on a chromosome. Additionally can relate a location to the interval and output information based on this pair, which is interpreted as a SNP/gene pair. This tool is best used with output from Pos2LDBlock - Assignment of closest LD block for a location..

Input

As there are two use cases for this tool, there are two types of input. In the more simple, two columns holding genomic coordinate entries describe the interval to be queried for genes. If in addition a column is provided with genomic coordinates, it is interpreted as a variation location that relates to the query interval and my be inside or outside the interval and inside or outside of any of the genes reported to lie within the interval.

Options applicable to more than a single tool are summarized in common command line options.

Output

For any input provided to this tool, it is necessary that the genomic coordinates be on the same chromosome, otherwise empty_cells are output.

If only two columns make up the input, a single column termed Ensembl Gene ID is appended to the output, with the Ensembl gene identifiers of the protein coding genes that intersect with interval. An empty cell is output if no protein coding gene is found overlapping with the interval.

If a third column specifies SNP locations, in addition to the Ensembl gene identifier the tool outputs three more columns with distance related information, LD Block Dist (Int2Gene), Dist (Int2Gene), and Dist Rank (Int2Gene), present information on the relative position of the variation to the interval and to the gene.

The following cases are encoded in the distances: (In these rules, emphasis is put on the relationship between the SNP and the gene, the LD block becomes important if the SNP is contained in it.)

  • The variation is contained in the interval and the interval contains protein coding genes. The first column is then set to zero, the second column is the distance from the variation location to the gene's begin or end, whatever is smaller. If the variation happends to be located within the gene itself, this distance is also set to zero. One possible ambiguity is given by a gene that is not contained in an interval, but the SNP location is contained in the gene: here we have both columns set to 0 (zero). This ambiguity is resolved when intervals were retrieved using the tool Pos2LDBlock - Assignment of closest LD block for a location., where the column LD covers would present the necessary information.

    Thus, column LD Block Dist (Int2Gene) can be interpreted as the distance to the gene from the perspective of the LD block: If the gene is contained in it, it is zero, otherwise it's the distance to the gene.

  • The variation is outside the query interval, or the interval query did not result any protein coding gene. The first and and the second column both hold the distance to closest protein coding gene, which may not be in the interval itself.

As an example, we take the output file of Pos2LDBlock - Assignment of closest LD block for a location. as input file /tmp/i2g.tsv:

GC                 Comment                                                                                             LD block LD covers LD begin           LD end
GRCh37:5:125707200 Block 5:842, SNP inside LD block, LD block is inside gene GRAMD3                                    5:842    1         GRCh37:5:125699985 GRCh37:5:125763912
GRCh37:5:125699800 Left of block 5:842, no LD block there, SNP is still inside GRAMD3                                  5:4694   0         GRCh37:5:125699599 GRCh37:5:125699745
GRCh37:5:125680000 Block 5:1909, left of GRAMD3, SNP inside LD block, SNP outside gene, block covers gene              5:1909   1         GRCh37:5:125678786 GRCh37:5:125699575
GRCh37:5:125678600 Block 5:4193, left of GRAMD3, SNP in LD block, LD block does not cover gene, GRAMD3 is nearest gene 5:4193   1         GRCh37:5:125678354 GRCh37:5:125678765
GRCh37:5:125676400 SNP is left of block 5:3561, block does not cover gene, nearest gene is GRAMD3                      5:895    0         GRCh37:5:125616528 GRCh37:5:125676392
GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4.                                           5:2104   1         GRCh37:5:132477615 GRCh37:5:132494146
bad data           This will produce empty cells                                                                       --       --        --                 --

then the command querying for genes in the intervals described by the first two columns and relating them to the third column

$ python Interval2Genes.py -H -c 5 6 --snp-pos 1 /tmp/i2g.tsv

will output

GC                 Comment                                                                                             LD block LD covers LD begin           LD end             Ensembl Gene ID LD Block Dist (Int2Gene) Dist (Int2Gene) Dist Rank (Int2Gene)
GRCh37:5:125707200 Block 5:842, SNP inside LD block, LD block is inside gene GRAMD3                                    5:842    1         GRCh37:5:125699985 GRCh37:5:125763912 ENSG00000155324 0                        0               1
GRCh37:5:125699800 Left of block 5:842, no LD block there, SNP is still inside GRAMD3                                  5:4694   0         GRCh37:5:125699599 GRCh37:5:125699745 ENSG00000155324 0                        0               1
GRCh37:5:125680000 Block 5:1909, left of GRAMD3, SNP inside LD block, SNP outside gene, block covers gene              5:1909   1         GRCh37:5:125678786 GRCh37:5:125699575 ENSG00000155324 0                        15824           1
GRCh37:5:125678600 Block 5:4193, left of GRAMD3, SNP in LD block, LD block does not cover gene, GRAMD3 is nearest gene 5:4193   1         GRCh37:5:125678354 GRCh37:5:125678765 ENSG00000155324 17224                    17224           1
GRCh37:5:125676400 SNP is left of block 5:3561, block does not cover gene, nearest gene is GRAMD3                      5:895    0         GRCh37:5:125616528 GRCh37:5:125676392 ENSG00000155324 19424                    19424           1
GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4.                                           5:2104   1         GRCh37:5:132477615 GRCh37:5:132494146 ENSG00000053108 45003                    45003           1
GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4.                                           5:2104   1         GRCh37:5:132477615 GRCh37:5:132494146 ENSG00000170606 45003                    45003           2
bad data           This will produce empty cells                                                                       --       --        --                 --                 --              --                       --              --