Data Integrator
Gcoords2gcoords - Generation and split of genomic coordinates

The Gcoords2gcoords tool generates genomic coordinates from common GWA data files where this information is usually stored in multiple columns. This tool also handles the reverse operation, which consists of splitting genomic coordinates into columns.

Input

Gcoords2gcoords is a command-line tool that splits or reunites columns with genomic coordinates, which is distinguished by the option –to-gcoords.

Splitting a genomic coordinate into two/tree columns requires a single column containing the genetic coordinates to be split. The argument specified by the optional parameter –chr-prefix is used as a prefix for each newly generated chromosome name.

Conversely, generating a column with genomic coordinates requires two input columns, chromosome name and position, respectively. If the chromosome name is prefixed with a string, in this case, the parameter –chr-prefix is used to remove the string in order to retrieve the chromosome name. The optional parameter -b, defines the genome build of the input dataset. If this is not specified, the default genome build name will be chosen instead.

Options applicable to more than a single tool are summarized in common command line options.

Output

If the –to-gcoords parameter is used, two columns will be merged into one with genomic coordinates, but if it is omitted, genomic coordinates are split into two/three columns.

By default, for the split operation (indicated by the absence of –to-gcoords), two new columns will be added to the input file, Chromosome Name and Position. If the –inc-build is supplied, an additional column labeled Build is added prior to the chromosome name and position column.

On the other hand, for the generation of genomic coordinates one new column, Genomic Coordinates, will be added to the input file.

For example, given the input file /tmp/gcoords-in.tsv:

Genomic Coordinates
GRCh37:1:13668
GRCh37:2:114357350
GRCh37:11:65270104
GRCh37:X:41277944
ABCd11:25:33333333
Nonsense
--

then the following command splits the genomic coordinates, adds the Build column and prefixes the chromosome name with chr:

$ python Gcoords2gcoords.py -H -c 1 --inc-build --chr-prefix chr --from-file /tmp/gcoords-in.tsv

The result file looks like this.

Genomic Coordinates    Build    Chromosome Name    Position
GRCh37:1:13668         NCBI36   chr1               13668
GRCh37:2:114357350     GRCh37   chr2               114357350
GRCh37:11:65270104     GRCh37   chr11              65270104
GRCh37:X:41277944      GRCh37   chrX               41277944
ABCd11:25:33333333     --       --                 --
Nonsense               --       --                 --
--                     --       --                 --

Given instead another input file /tmp/gcoords-in2.tsv :

SNP          CHR      POS
rs3121561    chr1     980243
rs1077918    chr15    38991028
rs2472394    chrX     153424545
rs2562130    chry     57747936
badData      --       100000
nonono       --       --

then the following command generates genomic coordinates.

$ python Gcoords2gcoords.py --to-gcoords -H -c 2 3 -b NCBI36 --chr-prefix chr --from-file /tmp/gcoords-in2.tsv

The result file will be:

SNP          CHR      POS          Genomic Coordinates
rs3121561    chr1     980243       NCBI36:1:980243
rs1077918    chr15    38991028     NCBI36:15:38991028
rs2472394    chrX     153424545    NCBI36:X:153424545
rs2562130    chry     57747936     NCBI36:y:57747936
badData      --       100000       --
nonono       --       --           --