Data Integrator
gcoords2cons - Variation consequences

Briefly, this tool allows to compute consequence types and impact of amino acid substitutions on protein structure and function for arbitrary genomic variations in the human genome. Its functionality can broadly be divided into two categories: Impact of a mutant allele on a) the genome or protein level and b) on the function of the protein, if the gene (more precisely, the associated transcript) translates into a protein.

This tool utilizes the Ensembl Perl API.

Input

Mutations classified with this program comprise non-reference base-pair substitutions, and small insertions and deletions of human genomic DNA. All of these mutations can be described by a genomic coordinate and a pair of reference/mutation allele bases, which constitute the input set for this tool. The genomic coordinate does not necessarily need to be a valid dbSNP entry, all coordinates may serve as input.

Mutation consequences

Primary focus has been put on retrieving immediate consequences of a genomic mutation. Due to alternative splicing, different transcripts of the same gene may have frame shifted exons which translate into a completely different protein sequence. Therefore, it is necessary to analyze the impact of a mutation on a per transcript level. Utilizing the Ensembl database has several implications:

  • Restrict transcripts to any combination of the the following transcript data bases: HGNC, Clone based Ensembl, Clone based Vega, RFAM, and miRBase database. Transcripts from the HGNC database have a HGNC symbol associated with them, a sign that associated genes have been recognized and characterized. Clone based transcript databases describe items that do not have an associated HGNC symbol. See the Ensembl r75 gene build document and its related links for some more information.
  • Restrict to a subset of biotypes, for instance, only to protein coding transcripts. A comprehensive list of possible biotypes is given by the VErtebrate Genome Annotation (Vega) documentation web page.
  • A classification for mutation consequences already exists, including a heuristic ranking of individual mutations' impact on the gene product, which allows to filter for the highest-most impacting mutation over all known transcripts.

Prediction of mutations on protein function

Several tools exist to predict the effect of a mutation on the protein itself. For example, if a protein structure is known, the mutation may render an important residue, such as a substrate binding or active site residue, into non-functional amino acid, implying a severe damage to the protein's function. Two tools accomplishing such predections are available within the Ensembl framework: SIFT and PolyPhen2. In addition, a concensus method, Condel was shown to synergize the two methods and improve predictions. All three approaches are available within our Data Integration framework.

Options applicable to more than a single tool are summarized in common command line options.

Mutations, indels and strand detection

The gcoords2cons genomic coordinate specification for mutations, insertions and deletions is different than Ensembl's Variant Effect Predictor. Also, gcoords2cons automatically detects on which strand a variation is on by using the reference allele/s provided by the variation itself.

For convenience, here is how genomic coordinates can be specified: [BUILD:]START[-END], also see genomic coordinate.

BUILD is optional, as well as END if it specifies a single basepair variation. END must always be >= than START.

As a general rule, the length of the sequence as specified by END-START+1 should always match the length and position of the reference.

Mutations are specified as follows:

  • START should be the position of the mutation itself.
  • If a multi-allele mutation has been specified, END-START+1 should match the number of mutated alleles.
  • Both the reference and alternate alleles should have the same length.

Examples:

  • A valid single basepair mutation: GRCh37:10:72604246 T G
  • A valid multi-basepair mutation: GRCh37:10:72604246-72604247 TC GA

Deletions can be specified as follows:

  • START should match the position of the first reference allele.
  • END-START+1 should match the number of alleles specified in the reference.
  • The alternate alleles should be at least 1 basepair less than the reference.
  • The alternate allele can be "-" to indicate an empty sequence.

Examples:

  • A valid single basepair deletion: GRCh37:10:72604246 T -
  • A valid multi-basepair deletion: GRCh37:10:72604246-72604247 TC -
  • Another valid multi-basepair deletion which deletes the second allele at 72604247: GRCh37:10:72604246-72604247 TC T

Insertions can be specified as follows:

  • START should match the position of the first reference allele.
  • END-START+1 should match the number of alleles specified in the reference.
  • The alternate alleles should be at least 1 basepair more than the reference.
  • The reference allele can be "-" to indicate an empty sequence. In this case the alternate sequence is inserted at START directly, moving the original sequence forward by the length of the mutated alleles. Without a reference allele, the variation is assumed to be on the forward strand.

Examples:

  • A valid single basepair insertion: GRCh37:10:72604246 - T
  • A valid multi-basepair insertion: GRCh37:10:72604246 - TC

Mutations, insertions and deletions can be combined:

  • A valid mutation+deletion: GRCh37:10:72604246-72604247 TC A
  • A valid mutation+insertion: GRCh37:10:72604246 T CA
  • A valid multi-basepair mutation and insertion (with the insertion happening at 72604248): GRCh37:10:72604246-72604247 TC CAT

The reference alleles are also used to detect on which strand the mutation is happening. This is especially important for insertions, since the shortest insertion form: GRCh37:10:72604246 - T lacks a reference allele entirely. In that case, the forward (+) strand is always assumed.

There's no limit on how long the reference sequence can be, although a mis-matching reference will also cause gcoords2cons to assume the forward strand. If the variation has not been specified in minimal form gcoords2cons also complains about redundant alleles and displays the actual minimal variation as being calculated.

For example, the variation GRCh37:10:72604246-72604249 TCGT TCAT contains only a single basepair mutation which could be rewritten as GRCh37:10:72604248 G A. Both forms are accepted and calculated correctly (if the reference sequence matches the long form), though the former will cause a warning.

Output

Both types of analyses mentioned above can be run in either normal or most-severe mode. In the former, all transcripts and related scores, consequences types and scores are output line by line. In the latter case, only the consequence type of the gene with the highest impact (according to Ensembl ranking) is output. Some variations may also result in more than a single consequence having the same (highest) score: in this case, all the consequences having the same score are output too.

Be aware that even if the mutation is on an exon of a gene, it may happen that multiple lines will be printed, since the base at this position on the one hand is part of that gene, but on the other hand, it can also be in the UTR region or the up/downstream region of another gene. This becomes clear when outputting data in detail mode. In very specific cases, exons of different genes overlap, one exon is located on the forward strand, the other is on the reverse strand. In this case, also, the highest impacting consequence type is output.

Intergenic regions, even though labeled "INTERGENIC" in Ensembl, are reported by an empty cell. Otherwise, it is guaranteed that each valid genomic coordinate will have a consequence type. The reference allele is checked and used to determine the strandedness of the mutation. If it matches neither the forward nor the reverse strand, a warning is issued.

Mutation consequences

The following columns are output when running in detail mode:

  • Consequence type Human readable consequence type of the mutation's sequence ontology (SO) term, as provided by Ensembl.
  • Human Ensembl Gene ID Ensembl gene id for the consequence described in the corresponding line.
  • Transcript ID Ensembl transcript ID for this gene.
  • Canonical 1 if the Transcript is the canonical transcript ID, 0 otherwise.
  • Transcript database Database of origin for the transcript. See Mutation consequences for more details.
  • Transcript biotype Class of transcript. See Mutation consequences for more details.
  • Strand Transcription direction on DNA. The plus sign (+) stands for the forward (5' to 3'), the minus sign (-) signifies the reverse strand.
  • Consequence rank Ranking of consequences, where lower numbers have higher impact. See the list of GRCh37 Ensembl consequence types (r75) and GRCh38 Ensembl consequence types (r99), respectively.
  • Intron/exon number This is the intron or exon number of the variant in this transcript/gene. It may happen that the variant overlaps multiple introns/exons. In this case the result is written as A-B where A is the first overlapping exon and B is the last one.
  • Total introns/exons Total number of introns/exons in this transcript/gene.
  • Location The location (intron or exon) of the variation.
  • Codon change Return the original and resulting codon that this mutation is predicted to result in, separated with a slash.

Predicted deleteriousness of mutations on protein function

For mutations in protein coding regions, we utilize the SIFT and PolyPhen2 prediction methods for assigning a numeric score describing the (negative) influence of the mutation on protein function. All scores range from 0 to 1, the closer to score to 1, the more probable its deleterious influence. If the mutation is not a coding region, the empty cell is output in the score column(s).

We provide the following columns:

  • 1-SIFT This is the inverted original SIFT score. This was done, since we want to have all scores follow the same logics. In a publication by Di et al., the classification of scores is described as listed below:

    Score range Description
    1.00-0.95 Deleterious
    0.95-0.90 Potentially deleterious
    0.90-0.80 Borderline case
    0.80-0.00 Benign

  • PolyPhen The score output by PolyPhen. More details can be found in the PolyPhen Wiki, which is referred to by an email from the Ensembl list.

  • Condel The consensus prediction method, taking into account both SIFT and PolyPhen2 scores. A score higher than 0.469 is considered to be deleterious. (Based on their web server output, personal communication with Abel Gonzalez, 2012-10-24.)

HGVS Mutation nomenclature

HGVS is a standard for mutation nomenclature at several levels (DNA/RNA and protein level), which is based on the mutation's transcript.

gcoords2cons is able to generate HGVS identifiers at both the coding and protein level (which is automatically computed internally) for all the available transcripts in a mutation.

By using the --hgvs command line two additional columns are appended to the output file:

  • HGVSc Coding-level HGVS DNA identifier.
  • HGVSp More informative protein-level HGVS identifier.

Example

Given the input file /tmp/cons.tsv,

Genomic Coordinate  Ref. Allele  Mut. Allele
GCRh37:1:45111119   C            G
GRCh37:4:55394539   C            T

chris-cmd$ ./gcoords2cons.pl -H -c 1,2,3 –details –most-severe –sift </tmp/cons.tsv

will output the following lines:

Genomic Coordinate Ref. Allele Mut. Allele Consequence type      Transcript ID   Canonical Transcript biotype   Transcript database     Human Ensembl Gene ID   Strand  Consequence rank Intron/Exon number Total introns/exons Location Codon change  1-SIFT
GCRh37:1:45111119  C           G           missense_variant      ENST00000355387 1         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               12                 15                  exon     caG/caC       0.870
GCRh37:1:45111119  C           G           missense_variant      ENST00000361799 0         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               12                 15                  exon     caG/caC       0.870
GCRh37:1:45111119  C           G           missense_variant      ENST00000372247 0         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               11                 14                  exon     caG/caC       0.870
GCRh37:1:45111119  C           G           missense_variant      ENST00000443020 0         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               11                 14                  exon     caG/caC       0.870
GCRh37:1:45111119  C           G           missense_variant      ENST00000440132 0         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               8                  10                  exon     caG/caC       0.880
GCRh37:1:45111119  C           G           missense_variant      ENST00000335497 0         protein_coding       HGNC_transcript_name    ENSG00000187147         +       12               8                  9                   exon     caG/caC       0.870
GCRh37:1:45111119  C           G           missense_variant      ENST00000372244 0         protein_coding       HGNC_transcript_name    ENSG00000126106         -       12               2                  3                   exon     Ctg/Gtg       --
GRCh37:4:55394539  C           T           --                    --              --        --                   --                      --                      --      --               --                 --                  --       --            --

and the following to stderr:

gcoords2cons.pl WARN: ENOTFOUND(GRCh37:4:55394539): couldn't compute consequences for coordinate ``GRCh37:4:55394539''