Data Integrator
|
Briefly, this tool allows to compute consequence types and impact of amino acid substitutions on protein structure and function for arbitrary genomic variations in the human genome. Its functionality can broadly be divided into two categories: Impact of a mutant allele on a) the genome or protein level and b) on the function of the protein, if the gene (more precisely, the associated transcript) translates into a protein.
This tool utilizes the Ensembl Perl API.
Mutations classified with this program comprise non-reference base-pair substitutions, and small insertions and deletions of human genomic DNA. All of these mutations can be described by a genomic coordinate and a pair of reference/mutation allele bases, which constitute the input set for this tool. The genomic coordinate does not necessarily need to be a valid dbSNP entry, all coordinates may serve as input.
Primary focus has been put on retrieving immediate consequences of a genomic mutation. Due to alternative splicing, different transcripts of the same gene may have frame shifted exons which translate into a completely different protein sequence. Therefore, it is necessary to analyze the impact of a mutation on a per transcript level. Utilizing the Ensembl database has several implications:
Several tools exist to predict the effect of a mutation on the protein itself. For example, if a protein structure is known, the mutation may render an important residue, such as a substrate binding or active site residue, into non-functional amino acid, implying a severe damage to the protein's function. Two tools accomplishing such predections are available within the Ensembl framework: SIFT and PolyPhen2. In addition, a concensus method, Condel was shown to synergize the two methods and improve predictions. All three approaches are available within our Data Integration framework.
Options applicable to more than a single tool are summarized in common command line options.
The gcoords2cons
genomic coordinate specification for mutations, insertions and deletions is different than Ensembl's Variant Effect Predictor. Also, gcoords2cons
automatically detects on which strand a variation is on by using the reference allele/s provided by the variation itself.
For convenience, here is how genomic coordinates can be specified: [BUILD:]START[-END]
, also see genomic coordinate.
BUILD
is optional, as well as END
if it specifies a single basepair variation. END
must always be >= than START
.
As a general rule, the length of the sequence as specified by END-START+1
should always match the length and position of the reference.
Mutations are specified as follows:
START
should be the position of the mutation itself.END-START+1
should match the number of mutated alleles.Examples:
GRCh37:10:72604246 T G
GRCh37:10:72604246-72604247 TC GA
Deletions can be specified as follows:
START
should match the position of the first reference allele.END-START+1
should match the number of alleles specified in the reference.Examples:
GRCh37:10:72604246 T -
GRCh37:10:72604246-72604247 TC -
72604247
: GRCh37:10:72604246-72604247 TC T
Insertions can be specified as follows:
START
should match the position of the first reference allele.END-START+1
should match the number of alleles specified in the reference.START
directly, moving the original sequence forward by the length of the mutated alleles. Without a reference allele, the variation is assumed to be on the forward strand.Examples:
GRCh37:10:72604246 - T
GRCh37:10:72604246 - TC
Mutations, insertions and deletions can be combined:
GRCh37:10:72604246-72604247 TC A
GRCh37:10:72604246 T CA
72604248
): GRCh37:10:72604246-72604247 TC CAT
The reference alleles are also used to detect on which strand the mutation is happening. This is especially important for insertions, since the shortest insertion form: GRCh37:10:72604246 - T
lacks a reference allele entirely. In that case, the forward (+) strand is always assumed.
There's no limit on how long the reference sequence can be, although a mis-matching reference will also cause gcoords2cons
to assume the forward strand. If the variation has not been specified in minimal form gcoords2cons
also complains about redundant alleles and displays the actual minimal variation as being calculated.
For example, the variation GRCh37:10:72604246-72604249 TCGT TCAT
contains only a single basepair mutation which could be rewritten as GRCh37:10:72604248 G A
. Both forms are accepted and calculated correctly (if the reference sequence matches the long form), though the former will cause a warning.
Both types of analyses mentioned above can be run in either normal or most-severe mode. In the former, all transcripts and related scores, consequences types and scores are output line by line. In the latter case, only the consequence type of the gene with the highest impact (according to Ensembl ranking) is output. Some variations may also result in more than a single consequence having the same (highest) score: in this case, all the consequences having the same score are output too.
Be aware that even if the mutation is on an exon of a gene, it may happen that multiple lines will be printed, since the base at this position on the one hand is part of that gene, but on the other hand, it can also be in the UTR region or the up/downstream region of another gene. This becomes clear when outputting data in detail mode. In very specific cases, exons of different genes overlap, one exon is located on the forward strand, the other is on the reverse strand. In this case, also, the highest impacting consequence type is output.
Intergenic regions, even though labeled "INTERGENIC" in Ensembl, are reported by an empty cell. Otherwise, it is guaranteed that each valid genomic coordinate will have a consequence type. The reference allele is checked and used to determine the strandedness of the mutation. If it matches neither the forward nor the reverse strand, a warning is issued.
The following columns are output when running in detail mode:
Consequence type
Human readable consequence type of the mutation's sequence ontology (SO) term, as provided by Ensembl. Human Ensembl Gene ID
Ensembl gene id for the consequence described in the corresponding line. Transcript ID
Ensembl transcript ID for this gene. Canonical
1 if the Transcript is the canonical transcript ID, 0 otherwise. Transcript database
Database of origin for the transcript. See Mutation consequences for more details. Transcript biotype
Class of transcript. See Mutation consequences for more details. Strand
Transcription direction on DNA. The plus sign (+
) stands for the forward (5' to 3'), the minus sign (-
) signifies the reverse strand. Consequence rank
Ranking of consequences, where lower numbers have higher impact. See the list of GRCh37 Ensembl consequence types (r75) and GRCh38 Ensembl consequence types (r99), respectively. Intron/exon number
This is the intron or exon number of the variant in this transcript/gene. It may happen that the variant overlaps multiple introns/exons. In this case the result is written as A-B
where A
is the first overlapping exon and B
is the last one. Total introns/exons
Total number of introns/exons in this transcript/gene. Location
The location (intron
or exon
) of the variation. Codon change
Return the original and resulting codon that this mutation is predicted to result in, separated with a slash. For mutations in protein coding regions, we utilize the SIFT and PolyPhen2 prediction methods for assigning a numeric score describing the (negative) influence of the mutation on protein function. All scores range from 0 to 1, the closer to score to 1, the more probable its deleterious influence. If the mutation is not a coding region, the empty cell is output in the score column(s).
We provide the following columns:
1-SIFT
This is the inverted original SIFT score. This was done, since we want to have all scores follow the same logics. In a publication by Di et al., the classification of scores is described as listed below:
Score range | Description |
---|---|
1.00-0.95 | Deleterious |
0.95-0.90 | Potentially deleterious |
0.90-0.80 | Borderline case |
0.80-0.00 | Benign |
PolyPhen
The score output by PolyPhen. More details can be found in the PolyPhen Wiki, which is referred to by an email from the Ensembl list.
Condel
The consensus prediction method, taking into account both SIFT and PolyPhen2 scores. A score higher than 0.469 is considered to be deleterious. (Based on their web server output, personal communication with Abel Gonzalez, 2012-10-24.) HGVS is a standard for mutation nomenclature at several levels (DNA/RNA and protein level), which is based on the mutation's transcript.
gcoords2cons
is able to generate HGVS identifiers at both the coding and protein level (which is automatically computed internally) for all the available transcripts in a mutation.
By using the --hgvs
command line two additional columns are appended to the output file:
HGVSc
Coding-level HGVS DNA identifier. HGVSp
More informative protein-level HGVS identifier. Given the input file /tmp/cons.tsv
,
Genomic Coordinate Ref. Allele Mut. Allele GCRh37:1:45111119 C G GRCh37:4:55394539 C T
chris-cmd$ ./gcoords2cons.pl -H -c 1,2,3 –details –most-severe –sift </tmp/cons.tsv
will output the following lines:
Genomic Coordinate Ref. Allele Mut. Allele Consequence type Transcript ID Canonical Transcript biotype Transcript database Human Ensembl Gene ID Strand Consequence rank Intron/Exon number Total introns/exons Location Codon change 1-SIFT GCRh37:1:45111119 C G missense_variant ENST00000355387 1 protein_coding HGNC_transcript_name ENSG00000187147 + 12 12 15 exon caG/caC 0.870 GCRh37:1:45111119 C G missense_variant ENST00000361799 0 protein_coding HGNC_transcript_name ENSG00000187147 + 12 12 15 exon caG/caC 0.870 GCRh37:1:45111119 C G missense_variant ENST00000372247 0 protein_coding HGNC_transcript_name ENSG00000187147 + 12 11 14 exon caG/caC 0.870 GCRh37:1:45111119 C G missense_variant ENST00000443020 0 protein_coding HGNC_transcript_name ENSG00000187147 + 12 11 14 exon caG/caC 0.870 GCRh37:1:45111119 C G missense_variant ENST00000440132 0 protein_coding HGNC_transcript_name ENSG00000187147 + 12 8 10 exon caG/caC 0.880 GCRh37:1:45111119 C G missense_variant ENST00000335497 0 protein_coding HGNC_transcript_name ENSG00000187147 + 12 8 9 exon caG/caC 0.870 GCRh37:1:45111119 C G missense_variant ENST00000372244 0 protein_coding HGNC_transcript_name ENSG00000126106 - 12 2 3 exon Ctg/Gtg -- GRCh37:4:55394539 C T -- -- -- -- -- -- -- -- -- -- -- -- --
and the following to stderr:
gcoords2cons.pl WARN: ENOTFOUND(GRCh37:4:55394539): couldn't compute consequences for coordinate ``GRCh37:4:55394539''