Data Integrator
|
This small program maps human genomic coordinates to dbSNP IDs (rs*
). It may happen that a coordinate does not map to a unique dbSNP ID, and/or the dbSNP ID does not pass Ensembl quality check (QC). Such cases can be filtered by this program.
This tool utilizes the Ensembl Perl API.
The program expects a column for the genomic coordinate. By default, gcoords2snp
excludes SNPs that failed QC. These SNPs usually map to multiple and different locations. Using the –inc-failed
command line option will add these SNPs in the output.
Additionally, one can furthermore exclude all SNPs that, in turn, do not map back to a single coordinate by using the –unique
flag.
Both –inc-failed
and –unique
can be specified, as they are not exclusive.
Options applicable to more than a single tool are summarized in common command line options.
If the genomic coordinate cannot be mapped, an empty cell is output. Also, if SNPs mapping to multiple locations are omitted with –unique
, an empty cell is similarly output, and a warning is generated in the log. It is not possible to distinguish whether a genomic coordinate does not map to any SNP or maps to a SNP which has failed Ensembl QC; no warning is generated in such cases. This however is indicated by supplying the –inc-failed
command line option and the content of the additionally added "Failed (Ensembl
QC)"
column, which holds a 1
if there is a dbSNP ID but it has not passed the Ensembl QCs, and a 0
if there is a dbSNP ID but it has not passed the QC. Again, if there is no dbSNP ID at all, this column holds the empty cell identifier.
The tool handles variation as listed below. Since a genomic coordinate may define a range or a single position, the program lists all variation contained within the range. An output is made for the following cases.
Variation type | Position | Range |
---|---|---|
Single nucleotide variation | The query and SNP position are identical. | The SNP is within the query range. |
Deletion | Query position is contained within the range specified by the deletion. | Any non-empty intersection of the query range and the SNP deletion range. |
Insertion | The insertion takes place exactly after the query position. | The query range comprehends the position of the insertion or starts or ends exactly at the position of the insertion. |
Assume the input file /tmp/gc
.txt contains the following input
GCOORD GRCh37:1:13668 GRCh37:2:114357350 GRCh37:17:15666775
and is run with the following command line:
It will generate the output:
GCOORD RSID GRCh37:1:13668 -- GRCh37:2:114357350 -- GRCh37:17:15666775 rs75884461
In the output, it appears that GRCh37:1:13668
and GRCh37:2:114357350
do not map to any dbSNP ID, but re-running the command with –inc-failed
yields:
GCOORD RSID Failed (Ensembl QC) GRCh37:1:13668 rs2691328 1 GRCh37:1:13668 rs67465876 1 GRCh37:1:13668 rs80290005 1 GRCh37:2:114357350 rs67465876 1 GRCh37:2:114357350 rs73955041 1 GRCh37:2:114357350 rs80290005 1 GRCh37:17:15666775 rs82206 1 GRCh37:17:15666775 rs75884461 0
which shows that actually both map to several SNPs, which have failed to pass Ensembl QC. Many of those SNPs probably map to several locations (which is the main reason behind the QC failure), but it is not necessarily true. Re-running it again with both –inc-failed
and –unique
returns:
GCOORD RSID Failed (Ensembl QC) GRCh37:1:13668 -- -- GRCh37:2:114357350 rs73955041 1 GRCh37:17:15666775 rs75884461 0
The dbSNP rs73955041
has failed QC, but still maps to a single coordinate. We can verify it by going back to genomic coordinates using snp2gcoords - Conversion of dbSNP IDs to genomic coordinates again:
GCOORD RSID Failed (Ensembl QC) Genomic coordinates Failed (Ensembl QC) GRCh37:1:13668 rs2691328 1 GRCh37:1:13668 1 GRCh37:1:13668 rs2691328 1 GRCh37:16:63349 1 GRCh37:1:13668 rs2691328 1 GRCh37:12:91945 1 GRCh37:1:13668 rs2691328 1 GRCh37:15:102517502 1 GRCh37:1:13668 rs67465876 1 GRCh37:2:114357350 1 GRCh37:1:13668 rs67465876 1 GRCh37:1:13668 1 GRCh37:1:13668 rs80290005 1 GRCh37:2:114357350 1 GRCh37:1:13668 rs80290005 1 GRCh37:1:13668 1 GRCh37:1:13668 rs80290005 1 GRCh37:15:102517502 1 GRCh37:2:114357350 rs67465876 1 GRCh37:2:114357350 1 GRCh37:2:114357350 rs67465876 1 GRCh37:1:13668 1 GRCh37:2:114357350 rs73955041 1 GRCh37:2:114357350 1 GRCh37:2:114357350 rs80290005 1 GRCh37:2:114357350 1 GRCh37:2:114357350 rs80290005 1 GRCh37:1:13668 1 GRCh37:2:114357350 rs80290005 1 GRCh37:15:102517502 1 GRCh37:17:15666775 rs82206 1 GRCh37:17:15666775 1 GRCh37:17:15666775 rs82206 1 GRCh37:17:20481815 1 GRCh37:17:15666775 rs75884461 0 GRCh37:17:15666775 0
Notice how all dbSNP IDs except rs73955041
map to many different locations. In this case all dbSNP IDs map back at least once to the same original coordinate, but it should be considered that this might not always be true, even for SNPs that passed QC.