Data Integrator
Release history


Integrates Ensembl 105.

  • (data) ClinVar updated to 2021-12-04.
  • (data) DrugBank updated to V5.1.8.
  • (data) HGNC aliases from 2021-12-10.
  • (data) UniProt data from 2021-12-10
  • (data) FlyBase CGID mapping from FB2021_05.
  • (data) iRefIndex updated to v18.
  • (data) Reactome updated to v78.


Integrates Ensembl 100.

  • (data) ClinVar updated to 2020-06-02.
  • (data) DrugBank updated to V5.1.6.
  • (data) HGNC aliases from 2020-06-04.
  • (data) UniProt data from 2020-06-04
  • (data) VDRC dataset as of 2020-06-04.
  • (data) FlyBase CGID mapping from FB2020_02.


  • Integrates Ensembl 95.
  • HGMD support dropped (license no longer valid)
  • (data) ClinVar updated to 2019-01-18.
  • (data) DrugBank updated to V5.1.2.
  • (data) HGNC aliases from 2019-01-18.
  • (data) UniProt data from 2019-01-18.
  • (data) VDRC dataset as of 2019-01-18.
  • (data) FlyBase CGID mapping from FB2018_06.


  • Perl API update to post e75 APIs.
  • Integrates Ensembl 84 and 88 data.
  • (data) ClinVar updated to 2016-07-05 (v-00), 2017-01-30 (v-01), 2017-05-01 (v-02).
  • (data) HGMD updated to 201602 (v-00), 201701 (v-02).
  • (data) iRefIndex updated to v14.
  • (data) GO updated to 2016-07, left at this version, as the MySQL DB is discontinued.
  • (data) DrugBank updated to V5.0.1 (v-00), V5.06 (v-02).
  • (data) Reactome updated to version 57 (v-00), 59 (v-01), 60 (v-02).
  • (data) Update to PharmaADME data file to fit GRCh38.
  • (data) HGNC aliases from 2016-08-02 (v-00), 2017-04-21 (v-02).
  • (data) UniProt data from 2016-08-02 (v-00), 2017-04-21 (v-02).
  • (data) LD haplotype blocks updated to 1000 Genomes phase 3.
  • (data) Functional similarity data in sync with GO 2016-07.


  • GO-based gene enrichment tool.
  • Improved performance of InteractionAnnotator by querying MySQL database.
  • Improvement of MetaRanker scoring, distinguishes between 0 and 1 hits.
  • (data) ClinVar updated to version from 2015-03-03.
  • (data) HGMD updated to 2014.04
  • (data) Uniprot accession to trembl entry names updated to 2015_01.


  • Tool to perform gene prioritization with Endeavour-like functionality.
  • GOFunSim supports pairwise GO term functional similarity computation and information content query for a single GO term.
  • snp2gcoords filters for standard chromosome names.
  • (data) ClinVar updated to 2014.09.29
  • (data) HGMD updated to 2014.03
  • (data) Reworked data version data sheet. Highlighting from first release on, thorough syntax highlighting applied to all releases.


  • Tool to integrate Reactome data.
  • Tool to compute functional similarity for pairs of proteins.
  • Tool to annotate proteins with GO terms.
  • Consistent permissiveness model implemented in all Python command line tools.
  • Semiautomated framework for unit testing in Galaxy.
  • (data) ClinVar updated to 2014.06.04
  • (data) HGMD updated to 2014.02
  • (data) GO database as of 2014.07
  • (data) Update of DrugBank to version 4.1.
  • (data) Functional similarity based on GO annotations from 2012-01 and 2014-07.


  • Tool to convert columns to genomic coordinates and vice versa.
  • Tool to assign LD blocks to genomic coordinates.
  • Tool to map genomic intervals to genes.
  • Tool for Ensembl regulatory regions retrieval by interval queries.
  • Galaxy version 2014.02.10, using Python 2.7.
  • (data) HGMD updated to 2013.04.
  • (data) ClinVar udpated to 2014.02.11.
  • (data) iRef Index v13, 2013.12.


  • Python commands consistently report options for adding columns in the order as they are added to the output.
  • snp2gcoords retrieves validation states for SNPs, informing where the SNP is known. Helpful to find out if a SNP is recorded in 1000Genomes or in HapMap, for example.
  • computes LD for pairs of dbSNP entries.
  • DrugBankAnnotator returns pharmacological information on a protein basis.
  • InteractionAnnotator finds proteins (or their encoding genes) that interact with given proteins.
  • ClinVar functionality refactored, separate data object now.
  • Orthology module produces output ordered by gene id, as almost all others modules, too.
  • Variant class refactored, unit test for it.
  • LD block ID is now chromosome:blockID instead of a unique integer.


  • ClinVar integration.
  • PharmaADME integration.
  • Improved functionality of gcoords2snp for insertions and deletions.
  • snp2gcoords supports output of reference/alternative alleles and strand.
  • Chromosome check when converting genomic coordinates.
  • Mendelian filter program allows to skip variant calls not completely covered by all samples. New option for checking if any inheritance model complies with the variant.
  • HGMD updated to 2013.02.
  • HGMDAnnotator permits PubMed ID output optionally.
  • HGMDAnnotator allows collapsing and counting of cells instead of normalized output.
  • Bug fix in Galaxy interface for MendelianFilter autosomal recessive/dominant swapping.
  • Introduced internal sorting for entities with the same rank in gcoords2cons, since Perl started to behave non-deterministically.


  • Extract exon coordinates of a gene and related information: gene2canonexons
  • MetaRanker method for re-ranking genes based on scores from multiple individual prioritization methods.
  • Characterization of regions with high sequence conservation across genomes.
  • Variation to LD block to gene condensing. (Functionality is provided, no command line.)
  • Python module for gene related data, used by overlap calculations.
  • Installation of in-house Ensembl web mirror, allowing upload of in-house data to internal mirrored data base.
  • Start of scipy integration and removal of rpy2.
  • Updated 1000 genomes LD block data file.
  • LD block module operates on static R-tree data structure.
  • SO consequence terms output by gcoords2cons.


  • Improved Unit test environment, can run tests in parallel and uses the latest data version listed in the config file. This is made use of in the nightly unit tests.
  • Ensembl Gene ID to Ensembl Transcript ID added to HSEnsgProteinMapper.
  • LD blocks (SNPGeneGlobalLDChecker) outputs stable block IDs.
  • Experimental version of Mendelian filter for called variants.
  • Exact mappings for Ensembl transcript IDs to either Ensembl protein IDs or NCBI CCDS IDs and vice versa in HSEnsgProteinMapper.
  • Added gcoords2snp to map genomic coordinates back to dbSNP ids.
  • gcoords2cons –summary has been renamed to –most-severe and now outputs all the variations having the (same) highest rank, not only the first.
  • gcoords2cons –rank has been removed and fused into –details.
  • gcoords2cons emits the 'Transcript ID' column also in non-detailed mode.
  • gcoords2cons adds a new 'Canonical' column in –details mode to indicate which transcript is the canonical transcript.
  • gcoords2cons reports the intron or exon location, number and total along with the variation if it is within a gene.
  • gcoords2cons now checks the variation for consistency more thoroughly, emitting more detailed warnings.
  • gcoords2cons handles insertions correctly.
  • GalaxyInstall now allows to reference external documentation via automatic keyword substitution. Tool's help in Galaxy has been stripped to a minimum and now references the full tool documentation generated by doxygen.
  • snp2gcoords now removes failed SNPs by default, and accepts a –inc-failed switch to mimic gcoords2snp behavior, which also appends a new "Failed" column in the output to indicate the current state.
  • GNF Gene Atlas based gene expression module.
  • Tool for converting VCF files to Data Integrator tab separated files.
  • HGMD access module querying a local MySQL database based on variation, gene, or genome location data.


  • LD for pairs of dbSNP/gene based on a HapMap or 1K Genomes population. Output can be r^2 or D' scores.
  • SIFT/PolyPhen/ConDel scores added to gcoords2cons.
  • Module for intelligently joining two tables, no sorting needed like in the shell equivalent (TableJoiner).
  • LD-blocks checker updated with 1000 Genomes data.
  • LD-blocks checker can now be used on different datasets (–list-datasets/–dataset).
  • Drosophila melanogaster gene converter module.
  • Extensive online documentation coming with Galaxy.
  • New –rank option to output a numeric rank for Ensembl consequence types in gcoords2cons.
  • New –rank option to rank genes by distance to a SNP in gcoords2genes.
  • Automated unit test running on our development server with the newest nightly builds.
  • Heavily refactored Unit test suite. Highly flexible and user friendly now.
  • Preparation for Data Integrator architecture to be used in similar other projects.
  • New Transcript ID column added to gcoords2cons.
  • 'Ensembl Gene ID' column renamed to 'Human Ensembl Gene ID' in both gcoords2cons/gcoords2genes.
  • Every night more than 320 unit tests are automatically run on the current snapshot of the SVN repository to ensure code integrity.


  • LD-blocks checker for SNP-gene pairs based on HapMap2 genome data.
  • Added details mode (–details) in gcoords2genes.
  • Only report unique mapping (when requested) in both snp2gcoords and liftgcoords.
  • Genomic-coordinate to variations utility (gcoords2cons).


Second release of the DIntegrator:

  • Fully tested Perl API/framework compatible with Python infrastructure.
  • Split documentation between Perl/Python.
  • Split test-suite between Perl/Python.
  • Makefile for building documentation and running tests for both environments.
  • Whitespace is now ignored in both Perl/Python tools when performing lookup.
  • Implemented snp2gcoords in Perl.
  • Implemented gcoords2genes in Perl.
  • Implemented liftgcoords in Perl.


First release of the Data Integration project. One of the strengths of this framework is the precise documentation of data sources, which allows for reproducibility independent of time. Each release is tied to a set of data files which are used by the Data Integrator tools. Data set files will never be deleted and thus an analysis from the past can be reproduced at any time in the future. Furthermore, analyses may be redone with newer, up-to-date data sets.

  • Human gene converter: Interchangeable conversion between gene IDs from Ensembl, Entrez, HGNC (primary and aliases).
  • Human gene to protein mapper: This tool maps Ensembl gene IDs from human to protein IDs from UniProt (SwissProt and TrEmbl are kept separately) and Ensembl.
  • Ortholog mapper: Finds orthologs between human genes and genes from mouse, worm, and fruitfly.
  • Table conversion tools. Import of csv files with prudent handling of special characters. Includes a tool for splitting cells and merging them again at a later point.
  • The package has been tested under Python 2.6, 2.7, and 3.2. On Python 2.6, the argparse library needs to be installed separately.

This functionality is provided on multiple levels:

  1. As classes in Python.
  2. As command line programs interfacing the Python classes.
  3. As modules in the Galaxy web server framework, interfacing the command line programs.