Data Integrator
|
This tool takes as input human genomic coordinates from a previous genome build and lifts (ie. transforms) them to the current build version, which is as of the day of writing (2020-06-19) this document, GRCh38.
This tool utilizes the Ensembl Perl API.
The tool expects a column with genomic coordinates as defined in genomic coordinate. A build version can be supplied in case the genomic coordinates do not include it. Conversely, if the build is specified in the input cells, it can even be a mixture of different builds. Furthermore, in case the mapping to the current release is not unique, it is possible to omit such coordinates. In such a case the empty cell identifier is output.
With the continued support for GRCh37 in Ensembl, it's possible to uplift and downlift coordinate. If a Dintor release has both the GRCh37 and GRCh38 version included via its data versions, selecting a data version will implicitly fix the coordinate system where transformations are put into. If this is a GRCh37-based release, supplying genomic coordinate in GRCh38 will perform a transformation from GRCh38 to GRCh37.
This tool utilizes the Ensembl Perl API. Options applicable to more than a single tool are summarized in common command line options.
A column "Updated genomic coordinates
" is added, containing the current release's coordinates. Since a fully qualified genomic coordinate is output, it also contains the genome build and therefore gives rise to the version of the currently used build. The column contains the empty cell identifier if the mapping could not be conducted successfully or if it was not unique and output of ambiguously mapped entries has been disabled.
Let's assume we have GRCh37 data in data version v-00 and GRCh38 data in data version v-01. Given the input file /tmp/input.tsv
, which points to the same position for a variety of different genome releases,
GC GRCh38:13:100102937 GRCh37:13:100755191 NCBI36:13:99553192 NCBI35:13:99553192 NCBI34:13:98453192
we transform all these coordinates to GRCh38 by selecting v-01
in the data version argument
$ ./liftgcoords.pl -H --data-version v-01 -c 1 </tmp/input.tsv
and obtain the follwing output:
GC Updated genomic coordinates GRCh38:13:100102937 GRCh38:13:100102937 GRCh37:13:100755191 GRCh38:13:100102937 NCBI36:13:99553192 GRCh38:13:100102937 NCBI35:13:99553192 GRCh38:13:100102937 NCBI34:13:98453192 GRCh38:13:100102937
Conversely, using v-00
to get GRCh37 coordinates is done by the following call:
$ ./liftgcoords.pl -H --data-version v-00 -c 1 </tmp/input.tsv
and results in
GC Updated genomic coordinates GRCh38:13:100102937 GRCh37:13:100755191 GRCh37:13:100755191 GRCh37:13:100755191 NCBI36:13:99553192 GRCh37:13:100755191 NCBI35:13:99553192 GRCh37:13:100755191 NCBI34:13:98453192 GRCh37:13:100755191