Data Integrator
VCF2Dint - Conversion from VCF to Dintor input format.

This program performs a conversion from VCF file format to the input format accepted by the Dintor tools. The default options include only the fields required by the Dintor. The VCF QUAL, FILTER, FORMAT, and sample columns can be appended to the output with the –add-all option.

There are some special issues about the conversion that the user should be aware of:

  • In general, the Dintor coding of the coordinate, reference and alternative allele is identical to the Ensembl variant format. The differences between VCF and Ensembl format are described here and were used as reference for the implementation of this program.
  • The Dintor format differs from the Ensembl format in the case of insertions. Ensembl encodes the coordinate of an insertion as chr:end-start resulting in a negative coordinate range. Here start is the reference base before the insertion, so the actual insertion ranges from ]start, end]. In the Dintor, an insertion is encoded as chr:start-start to avoid a negative coordinate range. Here start is the same coordinate as in Ensembl, namely the position of the allele right before the insertion. The length of the insertion can be extracted from the alternative allele.
  • In case any field of a VCF variant line is empty (encoded by "." or "./."), it is replaced by the Dintor empty cell value.
  • A VCF variant line might contain information for multiple alternative alleles. The alt alleles are listed comma seperated in the ALT field. Also the INFO fields might contain comma seperated values for each alternative allele. This tool produces one output line per alternative allele; the INFO field only contains the value for the respective alternative allele. The FORMAT fields are NOT modified.
  • Sample FORMAT fields of the VCF file are never modified. In case of multiple alt alleles they might therefore lose their meaning.
  • The conversion description of the Ensembl website given above is complete, but does not give detailed examples. Especially in the case of multiple alleles (e.g. two insertions, or one insertion and one deletion) The Ref and Alt alleles can have different length. Therefore, we define the VCF to Dintor conversion rule here:
    • SNP: Genomic coordinate, Ref and Alt allele are identical in VCF and Dintor format.
    • Deletion: In VCF the reference base(s) preceding the deletion plus the deletion are given in the Ref field. In the Alt field, only the preceding reference base(s) are given. In the Dintor format, these preceding base(s) are stripped of the left side of the Ref allele. In most cases, there is only one preceding reference base, but multiple bases are possible. In VCF format, the genomic coordinate corresponds to the one reference base directly before the deletion, in Dintor format the genomic range of the actual deletion is given. Therefore, the Dintor end coordinate is VCF position + length of VCF Ref - 1. Accordingly, the Dintor start position is end - deletion length + 1. The Dintor coordinate is chr:start-end, where start is the first and end the last deleted base. The deletion range is therefore [start, end]. The Dintor Alt allele is always "-".
    • Insertion: In VCF the reference base(s) preceding the insertion are given in the Ref field. These preceding base(s) plus the insertion is given in the Alt field. In the Dintor format, these preceding base(s) are stripped of the left side of the Alt field. In most cases, there is only one preceding reference base, but multiple bases are possible. Since in VCF the genomic coordinate corresponds to the one reference base before the insertion, the Dintor genomic coordinate equals the VCF coordinate + length of VCF Ref - 1. The Dintor coordinate is chr:start-start, where start is the reference base before the insertion. The insertion range is therefore [start+1, ?]. The Dintor Ref allele is always "-".

Input

  • The VCF input file. This is a positional argument.
  • A flag indicating if the QUAL, FILTER, INFO and sample fields of the VCF file should be appended to the Dintor output (optional).
  • The default genome build, e.g. GRCh37 (optional).

Options applicable to more than a single tool are summarized in common command line options.

Example input:

#CHROM  POS        ID           REF     ALT            QUAL     FILTER  INFO            FORMAT  SRR065079
1       13683170   .            G       A              1066.4   PASS    AC=4;AF=0.571   GT      0/1
1       6209769    rs4068875    A       AATGG          934.32   PASS    AC=4;AF=1.00    GT      1/1
1       886049     rs34581264   ACAG    A              1107.3   PASS    AC=4;AF=1.00    GT      1/1
1       6219287    rs140568531  TCACA   T,TCA          421.43   PASS    AC=2,4;AF=0.2   GT      0/1
4       143326312  rs150473961  AAGAG   AAGAGAGAGAG,A  3064.0   PASS    AC=3,2;AF=0.3   GT      1/2
10      95549836   rs61662431   C       CT,CTT         389.2    PASS    AC=3,4;AF=0.3   GT      0/1
17      21318629   rs58862472   C       A,T            465.91   PASS    AC=2,3;AF=0     GT      2/2
20      31677535   rs35660059   AC      ACC,A          537.4    PASS    AC=2,2;AF=0.5   GT      0/1

Output

Given the above listed input as file /tmp/vcf.tsv, the following command line converts the VCF formatted variant file into a tab separated value (TSV) file in Dintor input format.

$ python VCF2Dint.py /tmp/vcf.tsv

Output:

ID             Genomic Coordinates      Ref       Alt
--             1:13683170               G         A
rs4068875      1:6209769-6209769        -         ATGG
rs34581264     1:886050-886052          CAG       -
rs140568531    1:6219288-6219291        CACA      -
rs140568531    1:6219290-6219291        CA        -
rs150473961    4:143326316-143326316    -         AGAGAG
rs150473961    4:143326313-143326316    AGAG      -
rs61662431     10:95549836-95549836     -         T
rs61662431     10:95549836-95549836     -         TT
rs58862472     17:21318629              C         A
rs58862472     17:21318629              C         T
rs35660059     20:31677536-31677536     -         C
rs35660059     20:31677536-31677536     C         -