Data Integrator
gcoords2snp - Conversion of genomic coordinates to dbSNP IDs

This small program maps human genomic coordinates to dbSNP IDs (rs*). It may happen that a coordinate does not map to a unique dbSNP ID, and/or the dbSNP ID does not pass Ensembl quality check (QC). Such cases can be filtered by this program.

This tool utilizes the Ensembl Perl API.

Input

The program expects a column for the genomic coordinate. By default, gcoords2snp excludes SNPs that failed QC. These SNPs usually map to multiple and different locations. Using the –inc-failed command line option will add these SNPs in the output.

Additionally, one can furthermore exclude all SNPs that, in turn, do not map back to a single coordinate by using the –unique flag.

Both –inc-failed and –unique can be specified, as they are not exclusive.

Options applicable to more than a single tool are summarized in common command line options.

Output

If the genomic coordinate cannot be mapped, an empty cell is output. Also, if SNPs mapping to multiple locations are omitted with –unique, an empty cell is similarly output, and a warning is generated in the log. It is not possible to distinguish whether a genomic coordinate does not map to any SNP or maps to a SNP which has failed Ensembl QC; no warning is generated in such cases. This however is indicated by supplying the –inc-failed command line option and the content of the additionally added "Failed (Ensembl QC)" column, which holds a 1 if there is a dbSNP ID but it has not passed the Ensembl QCs, and a 0 if there is a dbSNP ID but it has not passed the QC. Again, if there is no dbSNP ID at all, this column holds the empty cell identifier.

The tool handles variation as listed below. Since a genomic coordinate may define a range or a single position, the program lists all variation contained within the range. An output is made for the following cases.

Variation type Position Range
Single nucleotide variation The query and SNP position are identical. The SNP is within the query range.
Deletion Query position is contained within the range specified by the deletion. Any non-empty intersection of the query range and the SNP deletion range.
Insertion The insertion takes place exactly after the query position. The query range comprehends the position of the insertion or starts or ends exactly at the position of the insertion.

Examples

Assume the input file /tmp/gc.txt contains the following input

GCOORD
GRCh37:1:13668
GRCh37:2:114357350
GRCh37:17:15666775

and is run with the following command line:

$ ./gcoords2snp -H -c 1 < /tmp/gc.txt

It will generate the output:

GCOORD                  RSID
GRCh37:1:13668          --
GRCh37:2:114357350      --
GRCh37:17:15666775      rs75884461

In the output, it appears that GRCh37:1:13668 and GRCh37:2:114357350 do not map to any dbSNP ID, but re-running the command with –inc-failed yields:

GCOORD                  RSID            Failed (Ensembl QC)
GRCh37:1:13668          rs2691328       1
GRCh37:1:13668          rs67465876      1
GRCh37:1:13668          rs80290005      1
GRCh37:2:114357350      rs67465876      1
GRCh37:2:114357350      rs73955041      1
GRCh37:2:114357350      rs80290005      1
GRCh37:17:15666775      rs82206         1
GRCh37:17:15666775      rs75884461      0

which shows that actually both map to several SNPs, which have failed to pass Ensembl QC. Many of those SNPs probably map to several locations (which is the main reason behind the QC failure), but it is not necessarily true. Re-running it again with both –inc-failed and –unique returns:

GCOORD                  RSID            Failed (Ensembl QC)
GRCh37:1:13668          --              --
GRCh37:2:114357350      rs73955041      1
GRCh37:17:15666775      rs75884461      0

The dbSNP rs73955041 has failed QC, but still maps to a single coordinate. We can verify it by going back to genomic coordinates using snp2gcoords - Conversion of dbSNP IDs to genomic coordinates again:

$ ./gcoords2snp -H -c 1 --inc-failed < /tmp/gc.txt | ./snp2gcoords -H -c 2 --inc-failed
GCOORD                  RSID            Failed (Ensembl QC)  Genomic coordinates     Failed (Ensembl QC)
GRCh37:1:13668          rs2691328       1                    GRCh37:1:13668          1
GRCh37:1:13668          rs2691328       1                    GRCh37:16:63349         1
GRCh37:1:13668          rs2691328       1                    GRCh37:12:91945         1
GRCh37:1:13668          rs2691328       1                    GRCh37:15:102517502     1
GRCh37:1:13668          rs67465876      1                    GRCh37:2:114357350      1
GRCh37:1:13668          rs67465876      1                    GRCh37:1:13668          1
GRCh37:1:13668          rs80290005      1                    GRCh37:2:114357350      1
GRCh37:1:13668          rs80290005      1                    GRCh37:1:13668          1
GRCh37:1:13668          rs80290005      1                    GRCh37:15:102517502     1
GRCh37:2:114357350      rs67465876      1                    GRCh37:2:114357350      1
GRCh37:2:114357350      rs67465876      1                    GRCh37:1:13668          1
GRCh37:2:114357350      rs73955041      1                    GRCh37:2:114357350      1
GRCh37:2:114357350      rs80290005      1                    GRCh37:2:114357350      1
GRCh37:2:114357350      rs80290005      1                    GRCh37:1:13668          1
GRCh37:2:114357350      rs80290005      1                    GRCh37:15:102517502     1
GRCh37:17:15666775      rs82206         1                    GRCh37:17:15666775      1
GRCh37:17:15666775      rs82206         1                    GRCh37:17:20481815      1
GRCh37:17:15666775      rs75884461      0                    GRCh37:17:15666775      0

Notice how all dbSNP IDs except rs73955041 map to many different locations. In this case all dbSNP IDs map back at least once to the same original coordinate, but it should be considered that this might not always be true, even for SNPs that passed QC.