Data Integrator Tool Suite

Introduction

Welcome to the Center of Biomedicine (CBM) at EURAC research. This is the home page of the Data Integration (Dintor) tool suite with more than thirty modules ready for use by bioinformaticians and biologists working in genomics research.

Data emerging from genome wide association (GWA) studies and next generation sequencing (NGS) technologies provide a wealth of information ready to be used by the scientific community. These large data sets form the basis for further analysis based on the individual researcher's focus. Large scale processing for non-bioinformaticians, however, is hampered by the way these data sets are stored. For example, finding the closest protein coding gene next to a location encoded by a dbSNP identifier from a GWA result table may be done for a few SNPs of interest using the genome browsers, but the task becomes arduous once more than a handful of such entries have to be queried.

We therefore have developed Dintor, a suite of tools that facilitate working with GWA and NGS data. Beyond this goal, the framework offers modules for high level functional annotation of genes and gene products such as gene set prioritization, functional similarity of proteins, or clinical significance of variation data. Each of these tools has been designed to perform a basic task independently. The real power of the tool suite shows to advantage once these tools are combined to form a pipeline in order to accomplish a complex analysis.

The hallmarks of our approach are:

Modularity. Given a complex analysis, it is broken down to small accomplishable units, which can be processed by one of the available Dintor tools. The tools perform specific tasks and have been designed to cover a wide range of frequently encountered application scenarios in genome research, from simple problems like gene identifier conversions to complex calculations like functional similarity of gene products and gene prioritization.
Data versioning. Genomic and proteomic data bases are updated regularly in short intervals. A data set from today may already be out of date in a month from now. It is a laborious task to maintain the exact dates and versions of all data sets that are used for a certain analysis. We address this issue of constantly changing datasets by tying tools and their underlying data sets in a systematic way by keeping versioned copies of all these data. Thus it is always clear which data have been used during a certain analysis.
Reproducibility. Data versioning becomes especially valuable when re-running or maintaining a previously developed pipeline. Since tools and data are linked, the outcome will always be the same, even in years from now.
Scalability. The tool suite has specifically been designed for large-scale use. The present Galaxy server allows construction of large pipelines and processing of huge data files.
Modes of use. The Galaxy server has been set up to facilitate access to our Dintor tools by biologists with little background in bioinformatics. A second, expert mode of invocation is given by command line access to the tool suite, which can be downloaded as documented below. This allows implementation of very complex pipelines when Galaxy's limits have been reached.
Constant updates. Software and data files are constantly being updated. For certain tools we have a repository of tools and data ranging back to Ensembl 65. A stable Dintor software release can easily be furnished with new data sets by altering the configuration file. The framework accepts data from both human genome release 37 and 38 (GRCh37 and GRCh38).
Extensibility. The tool suite can be customized with either new data sets or by implementing new tools. The framework is very open to extensions and a wealth of well-documented library functions is at the developer's disposal. This requires knowledge of the Python and Perl programming languages.

Tutorials

On the tutorial page, we have collected several small examples that demonstrate usage of this web server.

Documentation

When opening a Dintor tool from the left pane ("CBM/Dintor"), each of the tool options comes with a small help text next to it. Tools share several standard options, which are described as follows:

Input dataset: This specifies a previously uploaded data file or a data file that resulted from a previous tool invocation in the pipeline and forms the input for the Dintor tool.
CBM/Dintor release: As described in the introduction, data and tools are linked in order to control versioning of the data set internally used by the tool. Each release of the Dintor suite is coupled to data sets used by the tools. Specifying a release here implies use of a data set from a specific date (along with a certain version of the tool itself). The relationship between tools and their data sets is provided in this table.
Mapping data version: A stable release can be equipped with more recent data by defining a new data version. This is convenient when new data files need to be used between software updates. Our data versions usually follow the release cycle of the Ensembl data base.
Input has header: Since tools add columns to tab separated input files, in many cases it is informative to include column names for these added columns. Check this box if the input file does have a header in the first line, which will then be interpreted a line where the column names are written to.
Empty cell value: When processing data in an automated fashion, it is very helpful to know when a tool could not compute results. In such cases a special code for "not available" is inserted, which in our tool suite defaults to two dashes, "--".

At the end of each tool, in the section "Full reference" there is a link to an HTML page with extensive online documentation, describing in more detail the method, available options, and output and providing small examples.

Human Genome Data

The Dintor framework has been developed during the time when human genome release 37 (GRCh37) was used as a reference genome. In the meanwhile, release 38 (GRCh38) is available, and the framework takes this into account. All Dintor releases prior to 2017-05 (2015-04, 2014-12, ...) work on GRCh37 only, as during that time this was the predominantly used reference genome.

Dintor release 2017-05 is a special release that combines both human reference genomes, GRCh37 and GRCh38. Data version v-00 is based on Ensembl version 75, which was the last release that worked with GRCh37 data. Data versions v-01 and v-02 use GRCh38 as a reference and are based on data from Ensembl versions 84 and 88, respectively. This allows users to analyze data for both human genome reference datasets in a single Dintor release. Subsequent releases will incorporate GRCh38 data only. However, tools can be run with any Dintor release and therefore GRCh37 data will remain available in the future.

Dintor release 2020-06 is the first release to integrate updated GRCh37 from Ensembl. Data version v-00 delivers GRCh37-based results, whereas data version v-01 returns results based on GRCh38. Orthology data are available only for GRCh38 (v-01). Both data versions relate to the same Ensembl release (r100 in this case). This scheme will hold for all future releases.

Dintor release 2021-12 is based on Ensembl 105 and has seen an update for all major databases included in Dintor, except for the GO database. We are unfortunately lacking the resources to align our GO API with the current GO distribution.

Download

You may want to use the Dintor suite on your local system for several reasons:

Privacy. A local installation can be coupled to a local installation of the database backends, so that no data will be queried from the public servers in the Internet.
Performance. Without sharing your backends with other users, programs are expected to execute faster.
Developing complex pipelines. When it comes to highly complex pipelines, Galaxy may not longer be the ideal platform for their implementation. As each tool can alternatively be called from the command line, a shell script provides a much higher flexibility in pipeline design.

The Dintor framework is packaged in three logically separated files for download:

Source code, dintor-src.tgz - Contains all releases of the Dintor tools suite. Please see the INSTALL file in the latest release after unpacking this tarball.
Data, dintor-data.tgz - Contains all versions of data files used by the tools. This file is well over 500MB in size, be prepared for longer download waiting time.
Ensembl API, ensembl-api.tgz - Contains all necessary Ensembl APIs needed for the Perl modules to access Ensembl data.

We keep record of release information and associated data files used by the tools. This information is available in the source distribution as an OpenOffice/LibreOffice file, but can also be viewed directly.

Citing Dintor

If you find Dintor useful for your work, please cite the following publication:

Weichenberger CX, Blankenburg H, Palermo A, D'Elia Y, König E, Bernstein E, Domingues FS (2015) Dintor: functional annotation of genomic and proteomic data. BMC Genomics, 16:1080. [PubMed] [DOI]

EU General Data Protection Regulation

For information on the EU Regulation No. 2016/679 - General Data Protection Regulation (GDPR) with respect to this web site, we refer to the privacy policy statement on our main web site.

Galaxy is an open, web-based platform for data intensive biomedical research. The Galaxy team is a part of BX at Penn State, and the Biology and Mathematics and Computer Science departments at Emory University. The Galaxy Project is supported in part by NHGRI, NSF, The Huck Institutes of the Life Sciences, The Institute for CyberScience at Penn State, and Emory University.