Data Integrator
MendelianFilter - VCF based filter for removing variants that do not comply with Mendelian inheritence.

Some phenotypic traits and diseases are inherited via Mendelian inheritance patterns. Often it is of interest to identify the variants, alleles, or genes that segregate via these patterns in a family or a group of unrelated individuals. This program filters a multi-sample variant file and removes variants, where the genotypes in individuals with the phenotype (affected) and without the phenotype (unaffected) do not agree with an assumed Mendelian mode of inheritance. The filtering is flexible with respect to the penetrance and detectance of the variants.

The algorithm is based on predefined genotypes that are expected in causal variants of affected individuals for each Mendelian mode of inheritance. For each variant, the program checks if the affected individuals have these genotypes, while respecting a detectance threshold and if the unaffected individuals do not have these genotypes, while respecting a penetrance threshold. The p-value of the Fisher exact test is computed with the null hypothesis that a genotype is distributed equally between affected and unaffected individuals.

Details on the model underlying this program are also given in this document.

Input

Required input:

  • A multi-sample variant file in VCF format. The input file must fulfill the VCF file specification. If a DP FORMAT field is present, the tool expects it to be greater zero if the GT field is populated. Otherwise the sample will be ignored in the evaluation of this variant.
  • A pedigree file that holds the sample ids as given in the VCF file, sample sex, and sample phenotypes in PED format. The Ped id, father id, and mother id fields need not be populated.
  • Either a Mendelian mode of inheritance or a flag that indicates that the filtering should be computed for all / any Mendelian modes of inheritance (also see next bullet). The different modes are:
    • autosomal dominant: The affected individuals are expected to have at least one identical alternative allele at an autosomal candidate variant (het ref/alt, het alt/alt, hom alt, hem alt).
    • autosomal recessive: The affected individuals are expected not to have a reference allele at an autosomal candidate variant (hom alt, het alt).
    • X dominant: The affected individuals are expected to have at least one identical alternative allele at an allosomal candidate variant. Men can only have the hem genotype (het ref/alt, het alt/alt, hom alt, hem alt).
    • X recessive: The affected individuals are expected not to have a reference allele at an allosomal candidate variant. Men can only have the hem genotype (hom alt, hem alt).
    • MT linked: The affected individuals are expected to have a hem alt genotype at a candidate variant on the mitochondrial DNA.
  • There are two different flags to archieve filtering for all Mendelian modes of inheritance:
    • any: The variant is retained, if it suffices any of the above models. The output format is VCF.
    • all: For each mode of inheritance, a flag and a corresponding p-value is appended to the input file to indicate whether the variant would pass the criteria for the given mode of inheritance. This is not a valid VCF format anymore.

Optional input:

  • Penetrance and / or detectance as real values between 0 and 1. If none is given, both values are set to 1, which corresponds to the strictest filtering. If one value is given, the other is set to the maximal possible value. If both values are given, they must not contradict each other. If they do, the user is informed and required to adapt his choice.
  • The minimal coverage to accept a variant call in a sample. If the coverage is below this value, the variant site is considered homozygous reference for this sample. Defaults to 1.
  • A flag to indicate how to handle variants with missing genotype information in a sample. If the flag –fail-missing-data is given, a variant which has missing genotype information for any sample in the ped file will never pass any mode of inheritance. If the flag is not given (default), the samples with missing genotype information for a variant are ignored in the evaluation of the variant.

Output

The output format depends on whether one Mendelian mode was given (-m mode), the filtering was tested for any mode (–any), or if the filtering was tested for all modes (–all).

In the first two cases, the output is the multi-sample VCF input file without the variants that have been filtered out for this option. Except for the missing lines and the p-value of the Fisher test that has been added to the INFO field, the input file is not modified and the VCF file format is retained.

In the second case, two columns per mode are added to the right side of the VCF file, thereby violating the VCF file format. The first column indicates whether the variant would be retained (1) or filtered out (0) for this mode. The second column gives the p-value of the Fisher exact test for this mode. The number of input variant lines is identical to the number of output variant lines. This format is not a valid VCF format.