Data Integrator
gcoords2ld - Linkage disequilibrium for genomic coordinate/gene pairs

Compute LD for a genomic coordinate / gene-ID or a genomic coordinate / genomic coordinate pair. This tool stands in contrast to SNPGeneGlobalLDChecker - Linkage disequilibrium block membership for genomic coordinates, where such a pair is checked for inclusion in an LD block. Here, based on a selected population from either HapMap or 1000 Genomes, we compute LD values for a given genomic coordinate pointing to a dbSNP entry and any dbSNP covered by a gene. Based on the desired output, either the highest r2 or D' value for all possible pairs of genomic coordinate is output. If chosen to only compute pairwise LD values, these are output as computed by Ensembl. We want to emphasize that Ensembl has several filters that may inhibit output of an LD value, even if the SNP is known in the chosen population:

  • The two SNPs are separated by more than 100,000 base bairs.
  • Number of haplotypes is less than 40.
  • The r2 measure for LD is less than 0.05 (!).
  • Either D' or r2 are larger than 1.

This tool utilizes the Ensembl Perl API.

Input

Input consists of two columns, one for the genomic coordinate and the other for the Ensembl Gene ID (ENSGXXXXXXXXXXX, where X is a series of digits), together making up the dbSNP/gene pair. Alternatively, the second column may also keep a genomic coordinate, indicating that pairwise LD should be computed only. This is specified by the –in option, which is set to either ensg, or gcoords. Clearly, the genomic coordinate must point to a SNP which is in present in the chosen population, otherwise an empty cell is output in the LD column(s). A population must be specified, based on which LD computations are performed. It is not advised to omit selection of the population, as this implies LD calculation of all possible populations currently available, a task which can take a very long time.

Options applicable to more than a single tool are summarized in common command line options.

Output

LD computation is output as a new column. Based on the selected metric, the column header is either LD r2 (POPULATION) or LD Dprime (POPULATION), where POPULATION stands for the selected population from either HapMap or 1000Genomes, or all if no population was selected (not recommended).

It is possible to compute both LD metrics, r2 and D' in a single run of the tool. In this case, the largest value of r2 will be output in the LD r2 column and the largest value of D' will be output to the LD Dprime column, just as if the tool were run twice.

No LD computation is carried out if the SNP is too far from the gene in question (as of 2012-12-11: 100,000 bp). If the genomic coordinate does not specify a valid dbSNP entry or if the dbSNP entry has not been used in HapMap/1000 Genomes, LD computation cannot be done. (See introduction.) In any case, an empty cell and a warning message will be output.

Example

Given the input file /tmp/ld.tsv

dbSNP ID     Ensembl gene ID  Genomic coordinate  Comment line
rs10492397   ENSG00000139618  GRCh37:13:32976358  Specification example
rs118166422  ENSG00000123240  GRCh37:10:97416972  60Mb distance
rs118166422  ENSG00000139618  GRCh37:10:97416972  Diff chrom
Inexistent   ENSG00000139618  GRCh37:10:32974518  No dbSNP entry

the command

chris-cmd$ ./gcoords2ld.pl -H -c 3,2 –r2 -p 1000GENOMES:phase_1_CEU </tmp/ld.tsv

will produce the following output (using Ensembl r68 database):

dbSNP ID     Ensembl gene ID  Genomic coordinate  Comment line           LD r2 (1000GENOMES:phase_1_CEU)
rs10492397   ENSG00000139618  GRCh37:13:32976358  Specification example  0.720
rs118166422  ENSG00000123240  GRCh37:10:97416972  60Mb distance          --
rs118166422  ENSG00000139618  GRCh37:10:97416972  Diff chrom             --
Inexistent   ENSG00000139618  GRCh37:10:32974518  No dbSNP entry         --

with the following warnings:

gcoords2ld.pl WARN: ENOTFOUND(GRCh37:10:97416972/ENSG00000123240): couldn't compute LD for ``GRCh37:10:97416972/ENSG00000123240'' pair
gcoords2ld.pl WARN: ENOTFOUND(GRCh37:10:97416972/ENSG00000139618): couldn't compute LD for ``GRCh37:10:97416972/ENSG00000139618'' pair
gcoords2ld.pl WARN: ENOTFOUND(ENSG00000139618): No feature/s at coordinate ``GRCh37:10:32974518''