Data Integrator
snp2gcoords - Conversion of dbSNP IDs to genomic coordinates

This small program converts human dbSNP IDs (rs*) to genomic coordinate of the current human genome assembly. It may happen that a dbSNP ID does not uniquely map to a single location on the human genome. Such cases can be filtered by this program.

This tool utilizes the Ensembl Perl API.

Input

The program expects a column with dbSNP rs* IDs. Additionally it can be specified that only dbSNP IDs are output which map to a single location on the genome.

SNPs that failed Ensembl's quality check are automatically excluded. Those SNPs usually map to a very large number of coordinates and/or some coordinates don't necessarily map back to the same SNP. To include them, you need to supply the –inc-failed switch in the command line. When –inc-failed is supplied, an additional column "Failed (Ensembl QC)" is appended to indicate its status. The option –inc-ref-alt adds three more columns to the output, Ref, Alt, and Strand. For each dbSNP entry, the reference and alternate allele pairs are printed. In rare cases there can be mismatches, then the alternate allele will be a question mark, '?'. In addition, the reference strand is output, mostly this is the forward strand, '+'. Supplying option –inc-validation-states will add a column with validation states, Validation States, which describes in which databases or by which method these SNPs have been validated. Since Ensembl has also introduced special regions on the chromosome, SNPs may also be reported on regions that do not represent the standard human chromosome names 1, 2, ..., 22, X, and Y. Choosing option '–only-std-chrom' will restrict output to the aforementioned standard chromosome names.

Options applicable to more than a single tool are summarized in common command line options.

Output

If the identifier cannot be mapped, an empty cell is output. Also, if SNPs mapping to multiple locations should be omitted, an empty cell is output for each of them.

Examples

Assume the input file /tmp/rs.tsv listed below, which contains two dbSNP IDs (rs94 and rs9961) that map ambiguously to multiple locations as well as an invalid identifier (rsUNK).

dbSNP ID
rs94
rs9961
rs188194665
rsUNK

The command line

chris-cmd$ ./snp2gcoords.pl -H -c 1 </tmp/rs.tsv

maps all SNPs to their genomic coordinates, even if they are ambiguous,

dbSNP ID     Genomic coordinates
rs94         GRCh37:Y:23206877
rs94         GRCh37:6:62315934
rs9961       GRCh37:21:20230679
rs9961       GRCh37:7:44841106
rs188194665  GRCh37:10:72604246
rsUNK        --

Contrary, the command line

chris-cmd$ ./snp2gcoords.pl -H -c 1 -u </tmp/rs.tsv

supresses output of dbSNP IDs that refer to multiple locations,

dbSNP ID     Genomic coordinates
rs94         --
rs9961       --
rs188194665  GRCh37:10:72604246
rsUNK        --

such that no location is output for these and the following warning messages are issued:

snp2gcoords.pl WARN: ENOTUNIQUE(rs94): SNP id ``rs94'' doesn't map uniquely
snp2gcoords.pl WARN: ENOTUNIQUE(rs9961): SNP id ``rs9961'' doesn't map uniquely

The command line:

chris-cmd$ ./snp2gcoords.pl -H -c 1 –inc-failed </tmp/rs.tsv

outputs the additional "Failed (Ensembl QC)" column instead:

dbSNP ID        Genomic coordinates     Failed (Ensembl QC)
rs94            GRCh37:Y:23206877       1
rs94            GRCh37:6:62315934       1
rs9961          GRCh37:21:20230679      1
rs9961          GRCh37:7:44841106       1
rs188194665     GRCh37:10:72604246      0
rsUNK           --                      --