Data Integrator
HGMDAnnotator - Retrieve information from HGMD (Human Gene Mutation Database)

This tool annotates individual variations, genes, and chromosomal regions with curated data from a CBM MySQL installation of the commercial Human Gene Mutation Database (HGMD). The current implementation of the tool is based on the use cases that we have identified based on our current applications of the HGMD database:

  1. genetic variation data will originate from next generation sequencing (NGS) runs, with mainly genome positions to be queried/compared.
  2. gene prioritization methods will need to find out which genes or if genes are associated with diseases.
  3. given a disease, it should be possible to retrieve the set of genes and variations that have been identified so far.

The major focus of HGMD is mutations that can be associated to genes and to diseases. Therefore, each of the roughly 134.000 mutation entries that is present in the HGMD release (2012.04) is associated to both, a gene and a disease. Only in few exceptional entries this is not the case. Genetic information is provided in terms of HGNC symbols and Entrez gene identifiers, disease terms are provided as they are found in the respective publication, i.e., there can be different entries for the same disease due to different spellings or abbreviations. For example, "Parkinson disease", "Parkinsonism", and "Parkinson's disease" are all used in HGMD. In a later version of the tool, MeSH and OMIM terms will be used for assigning diseases, but this is currently not fully implemented. HGMD is updated quarterly.

Mutations in HGMD are grouped into one of the following categories:

  1. Mutation (M)
  2. Splice (S)
  3. Regulatory (R)
  4. Deletion< (D)
  5. Insertion (I)
  6. Indel (X)
  7. Grodel (G)
  8. Grosins (N)
  9. Complex Rearrengement (P)
  10. Amplet/Repeat (E)

HGMD only provides chromosomal coordinates for the first six categories ( M, S, R, D, I, and X) the other, more complex mutations (G, N, P, E) are described via free text, thus can only be mapped to a gene (as identified via its ID), but not to a chromosomal region.

Another categorization of the mutation data in HGMD is done with respect to their evidence level and disease association:

  • DM and DM?: disease-causing mutations. A trailing question mark indicates that the curator had some concerns regarding the association/
  • DP: disease-associated polymorphisms
  • DFP: disease-associated polymorphisms with additional supporting functional evidence
  • FP: in vitro or in vivo functional polymorphisms
  • FTV: polymorphic or rare variants

HGMD is currently based on hg19 coordinates, whereas in the Dinthor world, coordinates are based on GRCh37. As both builds should in theory be identical, we handle them interchangeably, that is, input in the GRCh37 format (which is the only format accepted by the tool for chromosomal positions and ranges) will be queried against the hg19-based database and the resulting region coordinates, which can optionally be given by the tool, will be in the GRCh37 build again. From release 2015.02 onwards, HGMD will also provide the new GRCh38 build.

Input

The HGMDAnnotator is a command-line tool that operates on tabular data. The input file has to be specified with the command line option –from-file, as with other tools it is also possible to read from the standard input via –from file - <<<$'Something'. Based on the scenarios mentioned above, the tool accepts the following possible types of input , which are selected by the –in flag and the respective columns in the input file by the -c flag:

  • –in gcoords: a unique location on the genome that describes a variation (i.e., SNP),
  • –in range-rel and –in range-abs: a range spanning a region on a chromosome,
  • –in entrez: an NCBI Entrez gene identifier (e.g. 1212).
  • –in hgmd: stable HGMD identifier (e.g. CM080001).
  • –in umls: Unified Medical Language System (UMLS) identifier (e.g. C0030567).

With each of the supported inputs comes a different way to query for mutations associated with genes or other genomic regions:

  • The first case is given by the option –in entrez and a single column via the -c option. Valid Entrez gene identifiers consist only of numbers and can be retrieved from the NCBI Entrez Portal, for example, 5071 is the Entrez gene id of the Park2 gene.
  • The second case is input using the –in range-abs option, with two columns (start and end genomic coordinate) for the -c option. This will search for all mutations between start (column 1) and end coordinate (column 2), where the end coordinate has to be greater or equal to start.
  • The third case is defined by –in range-rel option and expects three columns (genomic coordinate, signed distance in bp to the start of the gene, signed distance to the end), where a positive number denotes an advance on the forward strand).
  • The forth case is given by the option –in umls and a single column via the -c option. This will return all mutations annotated to the disease concept identified by this Unified Medical Language System (UMLS) identifier. Note that this query might also report diseases that are not part of the disease concept, if a mutation is associated with multiple diseases, where one has to be contained in the concept. In order to filter out those entries, a disease filter has to be used.
  • The last case is given by the option –in hgmd and a single column via the -c option. This will return only the particular mutation identified by this stable HGMD identifier. Note that multiple diseases can be associated with a single HGMD identifier, as additional evidence from the literature might have been added to HGMD after the inital entry has been stored.

Coordinates have to be provided in the genomic coordinate format. Note that the coordinate build is treated in a case-sensitive fashion, that is, GRCh37:1:1-100 is a valid 100bp range on chromosome 1, while GRCH37:1:1-100 or grch37:1:1-100 are not, due to their "invalid" writing of GRCh37.

For a unique variation position, the tool expects a genomic coordinate and optionally the reference and the variant alleles, indicated by the –in gcoords option and either one or three column numbers specified by the -c option. Based on this information, it can be decided if exactly this mutation is linked to a disease (by setting the option –exact-ref-alt-match) or if the location itself is linked (default option if the flag is omitted).

Note that complex rearrangement (mutation categories G, N, P, and E described in the introduction), which are not clearly assigned to chromosomal regions in HGMD, can only be retrieved when querying by Entrez gene, HGMD, or UMLS identifiers. Queries for genomic regions will not return them, as they have no valid coordinates associated.

For all input types, we want can find out if there is an association with any or a specific disease. Filtering for specific diseases can be done by the flag –filter-disease, which accepts a list of HGMD disease terms, or –panel-file, which accepts a file containing HGMD disease terms. Note that the tool expects the exact HGMD disease terms as they are present in the database. Since terms can change even between HGMD releases, it is advisable to make sure they are correct before using them for filtering.

To determine valid HGMD disease terms, one can perform a wildcard search via the –query-disease-term option. If multiple search terms are used, they are by default used to further restrict the search. For example, –query-disease-term parkin would return all parkin-related HGMD disease terms, such as “Dystonia-parkinsonism, X-linked”, whereas –query-disease-term parkin dement would restrict to those containing both fragments, parkin and dement, such as “Subcortical dementia and parkinsonism”. By adding the –combine-terms option, which accepts the logical operators AND or OR, this default AND behavior can be changed. In our example, adding –combine-terms OR would return all disease terms that contain either parkin or dement.

The prefered way to retrieve information about diseases stored in HGMD is the query for Unified Medical Language System (UMLS) disease concepts via –query-disease-concept. A disease concept is a collection of multiple names, which all refer to the same disease. The basis for this unification is UMLS, which itself combines ontologies like ICD-10 or MeSH. By default, the tool will only report the UMLS ID of the matching concepts, e.g. –query-disease-concept morbus parkinson would return a single UMLS ID. By additing the –hgmd-disease-term option to the query, we would get another column listing the matching disease names, which would be the single term Morbus Parkinson in our example. In order to get all available names (and not just those matching our query string) for a particular disease concept, the –query-disease-concept query has to be used with the concept identifier as input. -query-disease-concept C0030567 –hgmd-disease-term would return all 80 names for Parkinsons disease (amongst them Morbus Parkinson).

If the disease names resulting from a disease names or disease concepts query are store in text file they can be used as a filter for other queries via the –panel-file option. Note that by default –combine-terms is set to the logical AND, meaning that all the provided terms have to be present in a disease. To limit an HGMD query to only those diseases present in a panel file, –combine-terms thus has to be set to OR.

Another, although no longer recommended, way for disease filtering is to use predefined disease sets via the flag –filter-disease-set. Disease sets are collections of HGMD disease terms that have been defined by human curators. The disease sets that are currently available and the disease terms they contain can be retrieved by the option –list-disease-sets. Note that disease sets are not directly provided by HGMD, but have to be added manually into the database. Changing disease sets therfore currently requires your local HGMD administrator. The recommended way for filtering for certain disease sets is to create them via the a disease concept query and used the disease names via a panel file.

A simple example can illustrate the difference between filtering by exact disease terms/panel files and disease sets. When querying for a region or a gene that is know to be associated to Parkinson's disease (PD), using the option –filter-disease Parkinson would result in no mutations being returned, as HGMD does not use the exact disease term "Parkinson" (Note that the search is case-insensitive, i.e., Parkinson, PARKINSON, or parkinson would all be treated equally and return no result). Filtering with the option –filter-disease "Parkinson's disease" would already return all PD mutations that are annotated to exactly this term, but would, for instance, still filter our those mutations associated to “Dystonia-parkinsonism, X-linked”. Only when using the option –filter-disease-set Parkinson one would retrieve all Parkinson-related mutations, as this disease set contains around 60 HGMD disease terms related to Parkinson. The same result could be retrieved by first querying for all Parkinson disease terms via –query-disease-term parkin, storing the resulting list in a panel file and then performing a second query with the option –panel-file file.tsv –combine-terms OR.

The abovementioned –combine-terms flag can also be used in combination with the –filter-disease and –filter-disease-sets flags in order to change their default combination behavior. Using –filter-disease-set parkinson alzheimer –combine-terms OR, we can filter for mutation which are associated either with Alzheimer or with Parkinson's disease.

Upon omission of the –filter-disease and –filter-disease-set options, any association of the input with a disease term is reported in the output.

The parameters –filter-category and –filter-association can be used to restrict the results to certain mutation categories and disease associations. The query –filter-category M I D –filter-association DM would only retrieve mutations, insertions, and deletions that are marked as disease-causing.

The actual data which should be added to the input file can be selected via additional flags. If none of the following flags is used, an error is issued, as in this case no column would be added to the input file.

  • –pubmed: adds three columns specifying the publication that first identified the mutation. The columns specify the PubMed identifier (present in the vast majority of cases) the last name of the first author, and the publication year.
  • –gene: adds two columns specifiying the gene symbol and the Entrez Gene identifier of the gene that is associated with an HGMD entry.
  • –gcoords: add a column with genomic coordinates of the mutation in genomic coordinate format.
  • –strand: add a column with the strand of the mutation. "+" denotes the forward/plus strand, "-" the reverse/minus strand.
  • –ref-alt: adds two columns with the reference and the alternate allele as it is present in the HGMD database in Dinthor format. Note that HGMD always uses the strand that contains the disease gene associated with the current mutation as reference. This can cause issues with methods that always refer to the forward strand as reference, see -to-forward-strand option.
  • –to-forward-strand: If the mutation is on the reverse/minus strand, using this option will inverted strand, the reference and alternate allele.
  • –mutation-type: add a column with further information on the mutation category and the disease association (see Introduction).
  • –omim: add a column wit the OMIM identifier of the gene that is associated to the current entry.
  • –hgmd-disease-term: add the HGMD disease term as a separate column. If HGMD curators had concerns regarding a disease association, they indicated this by a trailing question mark in the disease name.
  • –hgmd-acc: add the HGMD accession number of the database entry, which should be stable between individual HGMD releases.
  • –hgmd-desc: adds four columns that describe the HGMD entry in more detail. Two columns contain free-text descriptions of the mutation and details on the publication where they have been derived, two columns provide HGVS representations of the mutation entry.
  • –umls: add the Unified Medical Language System identifier of the disease to the output.
  • –count: add the number of results under the currently requested flags, as the counts vary depending on the information that is requested. For instance, the query –in entrez -c 1 –from-file - <<<$'5071/n1212' –count would add a column containing a 1 for 5071 and the empty cell for 1212, indicating that there is some information in HGMD for Park2 (5071) but nothing for CLTB (1212). Adding –hgmd-disease-term to this query would increase the count for 5071 to 18, as 18 unique disease terms are reported in HGMD. Adding –pubmed instead would report 110 results, as HGMD contains 110 publications that report some data for Park2.

As all other tools in the Dinthor world, the HGMDAnnotator outputs information in a normalized form. That is, if a gene has five mutations in HGMD, five lines will be written. This default behavior can be altered by adding the –collapse flag. The collapsed output has the same number of lines as the input, i.e., all mutations that are potentially found are reported in one line. This can potentially result in very large lines.

The option –header-suffix can be used to add a particular string to the end of each HGMD column. This option can be usefull of multiple runs of thee HGMDAnnotator are performed, for instance, using different disease sets.

Additional options applicable to more than a single tool (such as –help, –data-version, –version, or –permissiveness) are summarized in common command line options.

Output

Based on the selected flags described above, the output will contain columns labeled:

  • Gene symbolEntrez Gene ID (flag –gene)
  • PMIDAuthor(s), Publication Year (flag –pubmed)
  • Genomic coordinates (HGMD) (flag –gcoords)
  • Strand (HGMD) (flag –strand)
  • Ref (HGMD), Alt (HGMD) (flag –ref-alt)
  • OMIM ID (HGMD) (flag –omim)
  • HGMD disease term (flag –hgmd-disease-term)
  • HGMD accession (flag –hgmd-acc)
  • HGMD mutation type (flag –mutation-type)
  • HGMD descriptionHGVS, HGVS all, HGMD comments (flag –hgmd-desc)
  • UMLS ID (flag –umls)
  • HGMD distinct count (flag –count)

If the option –header-suffix is uses, the string that was provided to this option will be added to the end of each column.

Mutation types are written as value pairs a:b, where a is the code specifying the mutation category (e.g. M, I, D) and b is the mutation category (e.g. DM, DP). Potential values for category and type are written in the Introduction, examples are M:FP for a disease associated polymorphism or I:DM? for potentially disease associated insertion.

If for a certain column no value is present in the HGMD database, the empty cell will be written. The two exceptions to this rule are the Ref and Alt columns, where, if no value is present, - will be written. This rule only applies if a genomic coordinate and either reference or alternate allele are present, e.g., in case of insertion or deletion. If for a (complex) mutation no coordinate is present, also the Ref and Alt columns will be denoted with an empty cell.

If the –collapse flag has been set, the output wil be condensed into the input line, i.e., no additional lines will be added to the input. If multiple values are present as the result of this, they will be joined by a "|". Note that this can result in very lengthy lines, for instance, if hundreds of disease terms or genomic coordinates are reported.