Data Integrator
|
This small program converts human dbSNP IDs (rs*
) to genomic coordinate of the current human genome assembly. It may happen that a dbSNP ID does not uniquely map to a single location on the human genome. Such cases can be filtered by this program.
This tool utilizes the Ensembl Perl API.
The program expects a column with dbSNP rs*
IDs. Additionally it can be specified that only dbSNP IDs are output which map to a single location on the genome.
SNPs that failed Ensembl's quality check are automatically excluded. Those SNPs usually map to a very large number of coordinates and/or some coordinates don't necessarily map back to the same SNP. To include them, you need to supply the –inc-failed
switch in the command line. When –inc-failed
is supplied, an additional column "Failed (Ensembl QC)"
is appended to indicate its status. The option –inc-ref-alt
adds three more columns to the output, Ref, Alt
, and Strand
. For each dbSNP entry, the reference and alternate allele pairs are printed. In rare cases there can be mismatches, then the alternate allele will be a question mark, '?'. In addition, the reference strand is output, mostly this is the forward strand, '+'. Supplying option –inc-validation-states
will add a column with validation states, Validation States
, which describes in which databases or by which method these SNPs have been validated. Since Ensembl has also introduced special regions on the chromosome, SNPs may also be reported on regions that do not represent the standard human chromosome names 1, 2, ..., 22, X, and Y. Choosing option '–only-std-chrom' will restrict output to the aforementioned standard chromosome names.
Options applicable to more than a single tool are summarized in common command line options.
If the identifier cannot be mapped, an empty cell is output. Also, if SNPs mapping to multiple locations should be omitted, an empty cell is output for each of them.
Assume the input file /tmp/rs.tsv
listed below, which contains two dbSNP IDs (rs94
and rs9961
) that map ambiguously to multiple locations as well as an invalid identifier (rsUNK
).
dbSNP ID rs94 rs9961 rs188194665 rsUNK
The command line
chris-cmd$ ./snp2gcoords.pl -H -c 1 </tmp/rs.tsv
maps all SNPs to their genomic coordinates, even if they are ambiguous,
dbSNP ID Genomic coordinates rs94 GRCh37:Y:23206877 rs94 GRCh37:6:62315934 rs9961 GRCh37:21:20230679 rs9961 GRCh37:7:44841106 rs188194665 GRCh37:10:72604246 rsUNK --
Contrary, the command line
chris-cmd$ ./snp2gcoords.pl -H -c 1 -u </tmp/rs.tsv
supresses output of dbSNP IDs that refer to multiple locations,
dbSNP ID Genomic coordinates rs94 -- rs9961 -- rs188194665 GRCh37:10:72604246 rsUNK --
such that no location is output for these and the following warning messages are issued:
snp2gcoords.pl WARN: ENOTUNIQUE(rs94): SNP id ``rs94'' doesn't map uniquely snp2gcoords.pl WARN: ENOTUNIQUE(rs9961): SNP id ``rs9961'' doesn't map uniquely
The command line:
chris-cmd$ ./snp2gcoords.pl -H -c 1 –inc-failed </tmp/rs.tsv
outputs the additional "Failed (Ensembl QC)" column instead:
dbSNP ID Genomic coordinates Failed (Ensembl QC) rs94 GRCh37:Y:23206877 1 rs94 GRCh37:6:62315934 1 rs9961 GRCh37:21:20230679 1 rs9961 GRCh37:7:44841106 1 rs188194665 GRCh37:10:72604246 0 rsUNK -- --