Data Integrator
liftgcoords - Uplift genomic coordinates from past builds to current build

This tool takes as input human genomic coordinates from a previous genome build and lifts (ie. transforms) them to the current build version, which is as of the day of writing (2020-06-19) this document, GRCh38.

This tool utilizes the Ensembl Perl API.

Input

The tool expects a column with genomic coordinates as defined in genomic coordinate. A build version can be supplied in case the genomic coordinates do not include it. Conversely, if the build is specified in the input cells, it can even be a mixture of different builds. Furthermore, in case the mapping to the current release is not unique, it is possible to omit such coordinates. In such a case the empty cell identifier is output.

With the continued support for GRCh37 in Ensembl, it's possible to uplift and downlift coordinate. If a Dintor release has both the GRCh37 and GRCh38 version included via its data versions, selecting a data version will implicitly fix the coordinate system where transformations are put into. If this is a GRCh37-based release, supplying genomic coordinate in GRCh38 will perform a transformation from GRCh38 to GRCh37.

This tool utilizes the Ensembl Perl API. Options applicable to more than a single tool are summarized in common command line options.

Output

A column "Updated genomic coordinates" is added, containing the current release's coordinates. Since a fully qualified genomic coordinate is output, it also contains the genome build and therefore gives rise to the version of the currently used build. The column contains the empty cell identifier if the mapping could not be conducted successfully or if it was not unique and output of ambiguously mapped entries has been disabled.

Examples

Let's assume we have GRCh37 data in data version v-00 and GRCh38 data in data version v-01. Given the input file /tmp/input.tsv, which points to the same position for a variety of different genome releases,

GC
GRCh38:13:100102937
GRCh37:13:100755191
NCBI36:13:99553192
NCBI35:13:99553192
NCBI34:13:98453192

we transform all these coordinates to GRCh38 by selecting v-01 in the data version argument

$ ./liftgcoords.pl -H --data-version v-01 -c 1 </tmp/input.tsv

and obtain the follwing output:

GC                      Updated genomic coordinates
GRCh38:13:100102937     GRCh38:13:100102937
GRCh37:13:100755191     GRCh38:13:100102937
NCBI36:13:99553192      GRCh38:13:100102937
NCBI35:13:99553192      GRCh38:13:100102937
NCBI34:13:98453192      GRCh38:13:100102937

Conversely, using v-00 to get GRCh37 coordinates is done by the following call:

$ ./liftgcoords.pl -H --data-version v-00 -c 1 </tmp/input.tsv

and results in

GC                      Updated genomic coordinates
GRCh38:13:100102937     GRCh37:13:100755191
GRCh37:13:100755191     GRCh37:13:100755191
NCBI36:13:99553192      GRCh37:13:100755191
NCBI35:13:99553192      GRCh37:13:100755191
NCBI34:13:98453192      GRCh37:13:100755191