Data Integrator
|
Performs search for (protein coding) genes for a given interval on a chromosome. Additionally can relate a location to the interval and output information based on this pair, which is interpreted as a SNP/gene pair. This tool is best used with output from Pos2LDBlock - Assignment of closest LD block for a location..
As there are two use cases for this tool, there are two types of input. In the more simple, two columns holding genomic coordinate entries describe the interval to be queried for genes. If in addition a column is provided with genomic coordinates, it is interpreted as a variation location that relates to the query interval and my be inside or outside the interval and inside or outside of any of the genes reported to lie within the interval.
Options applicable to more than a single tool are summarized in common command line options.
For any input provided to this tool, it is necessary that the genomic coordinates be on the same chromosome, otherwise empty_cells are output.
If only two columns make up the input, a single column termed Ensembl Gene ID
is appended to the output, with the Ensembl gene identifiers of the protein coding genes that intersect with interval. An empty cell is output if no protein coding gene is found overlapping with the interval.
If a third column specifies SNP locations, in addition to the Ensembl gene identifier the tool outputs three more columns with distance related information, LD Block Dist (Int2Gene)
, Dist (Int2Gene)
, and Dist Rank (Int2Gene)
, present information on the relative position of the variation to the interval and to the gene.
The following cases are encoded in the distances: (In these rules, emphasis is put on the relationship between the SNP and the gene, the LD block becomes important if the SNP is contained in it.)
The variation is contained in the interval and the interval contains protein coding genes. The first column is then set to zero, the second column is the distance from the variation location to the gene's begin or end, whatever is smaller. If the variation happends to be located within the gene itself, this distance is also set to zero. One possible ambiguity is given by a gene that is not contained in an interval, but the SNP location is contained in the gene: here we have both columns set to 0 (zero). This ambiguity is resolved when intervals were retrieved using the tool Pos2LDBlock - Assignment of closest LD block for a location., where the column LD covers
would present the necessary information.
Thus, column LD Block Dist (Int2Gene)
can be interpreted as the distance to the gene from the perspective of the LD block: If the gene is contained in it, it is zero, otherwise it's the distance to the gene.
As an example, we take the output file of Pos2LDBlock - Assignment of closest LD block for a location. as input file /tmp/i2g.tsv
:
GC Comment LD block LD covers LD begin LD end GRCh37:5:125707200 Block 5:842, SNP inside LD block, LD block is inside gene GRAMD3 5:842 1 GRCh37:5:125699985 GRCh37:5:125763912 GRCh37:5:125699800 Left of block 5:842, no LD block there, SNP is still inside GRAMD3 5:4694 0 GRCh37:5:125699599 GRCh37:5:125699745 GRCh37:5:125680000 Block 5:1909, left of GRAMD3, SNP inside LD block, SNP outside gene, block covers gene 5:1909 1 GRCh37:5:125678786 GRCh37:5:125699575 GRCh37:5:125678600 Block 5:4193, left of GRAMD3, SNP in LD block, LD block does not cover gene, GRAMD3 is nearest gene 5:4193 1 GRCh37:5:125678354 GRCh37:5:125678765 GRCh37:5:125676400 SNP is left of block 5:3561, block does not cover gene, nearest gene is GRAMD3 5:895 0 GRCh37:5:125616528 GRCh37:5:125676392 GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4. 5:2104 1 GRCh37:5:132477615 GRCh37:5:132494146 bad data This will produce empty cells -- -- -- --
then the command querying for genes in the intervals described by the first two columns and relating them to the third column
$ python Interval2Genes.py -H -c 5 6 --snp-pos 1 /tmp/i2g.tsv
will output
GC Comment LD block LD covers LD begin LD end Ensembl Gene ID LD Block Dist (Int2Gene) Dist (Int2Gene) Dist Rank (Int2Gene) GRCh37:5:125707200 Block 5:842, SNP inside LD block, LD block is inside gene GRAMD3 5:842 1 GRCh37:5:125699985 GRCh37:5:125763912 ENSG00000155324 0 0 1 GRCh37:5:125699800 Left of block 5:842, no LD block there, SNP is still inside GRAMD3 5:4694 0 GRCh37:5:125699599 GRCh37:5:125699745 ENSG00000155324 0 0 1 GRCh37:5:125680000 Block 5:1909, left of GRAMD3, SNP inside LD block, SNP outside gene, block covers gene 5:1909 1 GRCh37:5:125678786 GRCh37:5:125699575 ENSG00000155324 0 15824 1 GRCh37:5:125678600 Block 5:4193, left of GRAMD3, SNP in LD block, LD block does not cover gene, GRAMD3 is nearest gene 5:4193 1 GRCh37:5:125678354 GRCh37:5:125678765 ENSG00000155324 17224 17224 1 GRCh37:5:125676400 SNP is left of block 5:3561, block does not cover gene, nearest gene is GRAMD3 5:895 0 GRCh37:5:125616528 GRCh37:5:125676392 ENSG00000155324 19424 19424 1 GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4. 5:2104 1 GRCh37:5:132477615 GRCh37:5:132494146 ENSG00000053108 45003 45003 1 GRCh37:5:132487144 This is located exactly in between genes HSPA4 and FSTL4. 5:2104 1 GRCh37:5:132477615 GRCh37:5:132494146 ENSG00000170606 45003 45003 2 bad data This will produce empty cells -- -- -- -- -- -- -- --