Data Integrator
|
The Gcoords2gcoords tool generates genomic coordinates from common GWA data files where this information is usually stored in multiple columns. This tool also handles the reverse operation, which consists of splitting genomic coordinates into columns.
Gcoords2gcoords is a command-line tool that splits or reunites columns with genomic coordinates, which is distinguished by the option –to-gcoords
.
Splitting a genomic coordinate into two/tree columns requires a single column containing the genetic coordinates to be split. The argument specified by the optional parameter –chr-prefix
is used as a prefix for each newly generated chromosome name.
Conversely, generating a column with genomic coordinates requires two input columns, chromosome name and position, respectively. If the chromosome name is prefixed with a string, in this case, the parameter –chr-prefix
is used to remove the string in order to retrieve the chromosome name. The optional parameter -b
, defines the genome build of the input dataset. If this is not specified, the default genome build name will be chosen instead.
Options applicable to more than a single tool are summarized in common command line options.
If the –to-gcoords
parameter is used, two columns will be merged into one with genomic coordinates, but if it is omitted, genomic coordinates are split into two/three columns.
By default, for the split operation (indicated by the absence of –to-gcoords
), two new columns will be added to the input file, Chromosome Name
and Position
. If the –inc-build
is supplied, an additional column labeled Build
is added prior to the chromosome name and position column.
On the other hand, for the generation of genomic coordinates one new column, Genomic Coordinates
, will be added to the input file.
For example, given the input file /tmp/gcoords-in.tsv
:
Genomic Coordinates GRCh37:1:13668 GRCh37:2:114357350 GRCh37:11:65270104 GRCh37:X:41277944 ABCd11:25:33333333 Nonsense --
then the following command splits the genomic coordinates, adds the Build
column and prefixes the chromosome name with chr
:
$ python Gcoords2gcoords.py -H -c 1 --inc-build --chr-prefix chr --from-file /tmp/gcoords-in.tsv
The result file looks like this.
Genomic Coordinates Build Chromosome Name Position GRCh37:1:13668 NCBI36 chr1 13668 GRCh37:2:114357350 GRCh37 chr2 114357350 GRCh37:11:65270104 GRCh37 chr11 65270104 GRCh37:X:41277944 GRCh37 chrX 41277944 ABCd11:25:33333333 -- -- -- Nonsense -- -- -- -- -- -- --
Given instead another input file /tmp/gcoords-in2.tsv
:
SNP CHR POS rs3121561 chr1 980243 rs1077918 chr15 38991028 rs2472394 chrX 153424545 rs2562130 chry 57747936 badData -- 100000 nonono -- --
then the following command generates genomic coordinates.
$ python Gcoords2gcoords.py --to-gcoords -H -c 2 3 -b NCBI36 --chr-prefix chr --from-file /tmp/gcoords-in2.tsv
The result file will be:
SNP CHR POS Genomic Coordinates rs3121561 chr1 980243 NCBI36:1:980243 rs1077918 chr15 38991028 NCBI36:15:38991028 rs2472394 chrX 153424545 NCBI36:X:153424545 rs2562130 chry 57747936 NCBI36:y:57747936 badData -- 100000 -- nonono -- -- --