This program performs a conversion from VCF file format to the input format accepted by the Dintor tools. The default options include only the fields required by the Dintor. The VCF QUAL, FILTER, FORMAT, and sample columns can be appended to the output with the –add-all option.
There are some special issues about the conversion that the user should be aware of:
-
In general, the Dintor coding of the coordinate, reference and alternative allele is identical to the Ensembl variant format. The differences between VCF and Ensembl format are described here and were used as reference for the implementation of this program.
-
The Dintor format differs from the Ensembl format in the case of insertions. Ensembl encodes the coordinate of an insertion as chr:end-start resulting in a negative coordinate range. Here start is the reference base before the insertion, so the actual insertion ranges from ]start, end]. In the Dintor, an insertion is encoded as chr:start-start to avoid a negative coordinate range. Here start is the same coordinate as in Ensembl, namely the position of the allele right before the insertion. The length of the insertion can be extracted from the alternative allele.
-
In case any field of a VCF variant line is empty (encoded by "." or "./."), it is replaced by the Dintor empty cell value.
-
A VCF variant line might contain information for multiple alternative alleles. The alt alleles are listed comma seperated in the ALT field. Also the INFO fields might contain comma seperated values for each alternative allele. This tool produces one output line per alternative allele; the INFO field only contains the value for the respective alternative allele. The FORMAT fields are NOT modified.
-
Sample FORMAT fields of the VCF file are never modified. In case of multiple alt alleles they might therefore lose their meaning.
-
The conversion description of the Ensembl website given above is complete, but does not give detailed examples. Especially in the case of multiple alleles (e.g. two insertions, or one insertion and one deletion) The Ref and Alt alleles can have different length. Therefore, we define the VCF to Dintor conversion rule here:
-
SNP: Genomic coordinate, Ref and Alt allele are identical in VCF and Dintor format.
-
Deletion: In VCF the reference base(s) preceding the deletion plus the deletion are given in the Ref field. In the Alt field, only the preceding reference base(s) are given. In the Dintor format, these preceding base(s) are stripped of the left side of the Ref allele. In most cases, there is only one preceding reference base, but multiple bases are possible. In VCF format, the genomic coordinate corresponds to the one reference base directly before the deletion, in Dintor format the genomic range of the actual deletion is given. Therefore, the Dintor end coordinate is VCF position + length of VCF Ref - 1. Accordingly, the Dintor start position is end - deletion length + 1. The Dintor coordinate is chr:start-end, where start is the first and end the last deleted base. The deletion range is therefore [start, end]. The Dintor Alt allele is always "-".
-
Insertion: In VCF the reference base(s) preceding the insertion are given in the Ref field. These preceding base(s) plus the insertion is given in the Alt field. In the Dintor format, these preceding base(s) are stripped of the left side of the Alt field. In most cases, there is only one preceding reference base, but multiple bases are possible. Since in VCF the genomic coordinate corresponds to the one reference base before the insertion, the Dintor genomic coordinate equals the VCF coordinate + length of VCF Ref - 1. The Dintor coordinate is chr:start-start, where start is the reference base before the insertion. The insertion range is therefore [start+1, ?]. The Dintor Ref allele is always "-".
Input
-
The VCF input file. This is a positional argument.
-
A flag indicating if the QUAL, FILTER, INFO and sample fields of the VCF file should be appended to the Dintor output (optional).
-
The default genome build, e.g. GRCh37 (optional).
Options applicable to more than a single tool are summarized in common command line options.
Example input:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR065079
1 13683170 . G A 1066.4 PASS AC=4;AF=0.571 GT 0/1
1 6209769 rs4068875 A AATGG 934.32 PASS AC=4;AF=1.00 GT 1/1
1 886049 rs34581264 ACAG A 1107.3 PASS AC=4;AF=1.00 GT 1/1
1 6219287 rs140568531 TCACA T,TCA 421.43 PASS AC=2,4;AF=0.2 GT 0/1
4 143326312 rs150473961 AAGAG AAGAGAGAGAG,A 3064.0 PASS AC=3,2;AF=0.3 GT 1/2
10 95549836 rs61662431 C CT,CTT 389.2 PASS AC=3,4;AF=0.3 GT 0/1
17 21318629 rs58862472 C A,T 465.91 PASS AC=2,3;AF=0 GT 2/2
20 31677535 rs35660059 AC ACC,A 537.4 PASS AC=2,2;AF=0.5 GT 0/1
Output
Given the above listed input as file /tmp/vcf.tsv
, the following command line converts the VCF formatted variant file into a tab separated value (TSV) file in Dintor input format.
$ python VCF2Dint.py /tmp/vcf.tsv
Output:
ID Genomic Coordinates Ref Alt
-- 1:13683170 G A
rs4068875 1:6209769-6209769 - ATGG
rs34581264 1:886050-886052 CAG -
rs140568531 1:6219288-6219291 CACA -
rs140568531 1:6219290-6219291 CA -
rs150473961 4:143326316-143326316 - AGAGAG
rs150473961 4:143326313-143326316 AGAG -
rs61662431 10:95549836-95549836 - T
rs61662431 10:95549836-95549836 - TT
rs58862472 17:21318629 C A
rs58862472 17:21318629 C T
rs35660059 20:31677536-31677536 - C
rs35660059 20:31677536-31677536 C -