Data Integrator
ClinVarAnnotator - Retrieve information from ClinVar (NCBI)

This tool annotates genomic variations and genomic regions with data from the ClinVar database [1]. There are two main different usage options:

  • Annotate a list of genomic coordinates with ClinVar attributes, if these genomic coordinates are in the ClinVar database (optionally requiring a ref-alt match).
  • Retrieve all entries in ClinVar, that lie in specified genomic regions.

Input

The ClinVarAnnotator operates on tabular data. The input file has to be specified with the command line option –from-file, it is also possible to read from standard input via –from file - <<<$'Something'.

According to the scenarios mentioned above, the tool accepts three possible types of input, which are selected with the –in flag and the column indices via the -c option. The numbers given with -c correspond to columns in the input file where the information that is required for the option given with –in is located.

Required input:

  • –in option: The type of analysis to perform. This must be one of:
    • gcoords: A genomic coordinate in the human genome. ClinVar annotations are added for matching genomic coordinates.
    • gcoords-ref-alt: A genomic coordinate in the genome and a reference and an alternative allele. ClinVar annotations are added for matching coordinates if the input reference and alternative allele match the ClinVar reference and alternative allele.
    • range-abs: A range spanning a region on one chromosome. All ClinVar variants and their annotations are added that lie in the specified range.
  • –c x [y z]>: The column indices in the input file, where data is located. Column indices start with one:
    • If –in gcoords: One column index in the input file that gives the location of the genomic coordinates.
    • If –in gcoords-ref-alt: The space seperated column indices that give the locations of the genomic coordinate, the reference allele, and the alternative allele in the input file, respectively.
    • If –in range-abs: Two space seperated integers. The first column gives the start coordinate and the second the end coordinate. Both coordinates must be on the same chromosome and must not span a negative range. Further, the start and end coordinate themselves must not be a range.
  • –from-file: A tab-seperated text file which holds the information as indicated with the –in and -c options.

Optional input: The ClinVar annotations that are to be appended to the output file. All available annotations can be added with the –all flag. The available annotations are:

  • –CLNSIG: Variant Clinical Significance
  • –CLNORIGIN: Allele Origin
  • –CLNDBN: Variant disease name
  • –CLNSRC: Variant clinical Chanels (variant source)
  • –CLNACC: Variant Accession and Versions
  • –PM: Variant is Precious(Clinical,Pubmed Cited)
  • –ref-alt: The reference and alternative allele as given in ClinVar
  • –gcoords: The genomic position of the variant in Dintor format
  • –id: The variant id as given in ClinVar (usually dbSNP)
  • –all: Abbreviation for adding all annotations. Giving this flag is equivalent to –CLNSIG –CLNORIGIN –CLNDBN –CLNSRC –CLNACC –PM –ref-alt –gcoords –id

Additional options applicable to more than a single tool (such as –help, –data-version or –version) are summarized in common command line options.

Example input 1:

gcoord
GRCh37:1:11082349

Example input 2:

gcoord               ref      alt
GRCh37:1:2340118     G        A
GRCh37:1:2452273     C        T

Example input 3:

range_start             range_end
GRCh37:1:11082348       GRCh37:1:11082461

Output

As with other Dintor tools, the output is the input, plus added columns. These added columns correspond to the annotations that the user required. If there are multiple entries in ClinVar that match one input line, the input line is output multiple times with the information from the different ClinVar entries. One ClinVar entry may contain multiple pipe seperated values; these are values from different submissions.

Example output 1: Given the above listed input file from example 1 saved as /tmp/gcoords.tsv, the following command line appends ClinVar annotations, if the genomic coordinates matches.

$python ClinVarAnnotator.py -H --in gcoords -c 1 --ref-alt --CLNSIG --from-file /tmp/gcoords.tsv
gcoord                 ClinVar.Ref    ClinVar.Alt     ClinVar.CLNSIG
GRCh37:1:11082349      G              A               pathogenic
GRCh37:1:11082349      G              C               pathogenic

Example output 2: Given the above listed input file from example 2 saved as /tmp/gcoords-ref-alt.tsv, the following comand line appends ClinVar annotations, if the genomic coordinate and the ref and alt allele match.

$ python ClinVarAnnotator.py -H --in gcoords-ref-alt -c 1 2 3 --ref-alt --CLNSIG --CLNORIGIN --from-file /tmp/gcoords-ref-alt.tsv
gcoord               ref   alt     ClinVar.Ref    ClinVar.Alt     ClinVar.CLNSIG       ClinVar.CLNORIGIN
GRCh37:1:2340118     G     A       G              A               pathogenic           unknown
GRCh37:1:2452273     C     T       C              T               untested             somatic

Example output 3: Given the above listed input file from example 2 saved as /tmp/range.tsv, the following comand line appends ClinVar annotations of variants that lie in the given range.

$ python ClinVarAnnotator.py --in range-abs -c 1 2 --gcoords --ref-alt --CLNSIG --from-file /tmp/gcoords.tsv
range_start             range_end               ClinVar.Gcoords          ClinVar.Ref    ClinVar.Alt     ClinVar.CLNSIG
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082349        G              A               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082349        G              C               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082358        G              A               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082397        A              G               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082409        G              A               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082425        C              T               untested
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082457        C              A               pathogenic
GRCh37:1:11082348       GRCh37:1:11082461       GRCh37:1:11082461        G              A               pathogenic

References

[1] Landrum, Melissa J., et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research (2013) [PMID 24234437]