 |
Data Integrator Tool Suite |
 |
Introduction
Welcome to the
Center of Biomedicine (CBM) at
EURAC
research. This is the home page of the Data Integration
(Dintor) tool suite with more than thirty modules ready for
use by bioinformaticians and biologists working in genomics research.
Data emerging from genome wide association (GWA) studies and next generation
sequencing (NGS) technologies provide a wealth of information ready to be used
by the scientific community. These large data sets form the basis for further
analysis based on the individual researcher's focus. Large scale processing
for non-bioinformaticians, however, is hampered by the way these data sets are
stored. For example, finding the closest protein coding gene next to a
location encoded by a dbSNP identifier from a GWA result table may be done for
a few SNPs of interest using the genome browsers, but the task becomes arduous
once more than a handful of such entries have to be queried.
We therefore have developed Dintor, a suite of tools that
facilitate working with GWA and NGS data. Beyond this goal, the framework
offers modules for high level functional annotation of genes and gene products
such as gene set prioritization, functional similarity of proteins, or
clinical significance of variation data. Each of these tools has been
designed to perform a basic task independently. The real power of the tool
suite shows to advantage once these tools are combined to form a pipeline in
order to accomplish a complex analysis.
The hallmarks of our approach are:
- Modularity. Given a complex analysis, it is broken down to small
accomplishable units, which can be processed by one of the available
Dintor tools. The tools perform specific tasks and have
been designed to cover a wide range of frequently encountered application
scenarios in genome research, from simple problems like gene
identifier conversions to complex calculations like functional similarity
of gene products and gene prioritization.
- Data versioning. Genomic and proteomic data bases are updated
regularly in short intervals. A data set from today may already be out of
date in a month from now. It is a laborious task to maintain the exact
dates and versions of all data sets that are used for a certain
analysis. We address this issue of constantly changing datasets by tying
tools and their underlying data sets in a systematic way by keeping
versioned copies of all these data. Thus it is always clear which data
have been used during a certain analysis.
- Reproducibility. Data versioning becomes especially valuable
when re-running or maintaining a previously developed pipeline. Since
tools and data are linked, the outcome will always be the same, even in
years from now.
- Scalability. The tool suite has specifically been designed for
large-scale use. The present Galaxy server allows construction of large
pipelines and processing of huge data files.
- Modes of use. The Galaxy server has been set up to facilitate
access to our Dintor tools by biologists with little
background in bioinformatics. A second, expert mode of invocation is given
by command line access to the tool suite, which can be downloaded as
documented below. This allows implementation of very complex pipelines
when Galaxy's limits have been reached.
- Constant updates. Software and data files are constantly being
updated. For certain tools we have a repository of tools and data ranging
back to Ensembl 65. A stable Dintor software release can
easily be furnished with new data sets by altering the configuration
file. The framework accepts data from both human genome release 37 and 38
(GRCh37 and GRCh38).
- Extensibility. The tool suite can be customized with either new
data sets or by implementing new tools. The framework is very open to
extensions and a wealth of well-documented library functions is at the
developer's disposal. This requires knowledge of the Python and Perl
programming languages.
Tutorials
On the
tutorial page, we have collected several
small examples that demonstrate usage of this web server.
Documentation
When opening a Dintor tool from the left pane ("CBM/Dintor"),
each of the tool options comes with a small help text next to it. Tools share
several standard options, which are described as follows:
- Input dataset: This specifies a previously uploaded data file
or a data file that resulted from a previous tool invocation in the
pipeline and forms the input for the Dintor tool.
- CBM/Dintor release: As described in the introduction, data and
tools are linked in order to control versioning of the data set internally
used by the tool. Each release of the Dintor suite is
coupled to data sets used by the tools. Specifying a release here implies
use of a data set from a specific date (along with a certain version of
the tool itself). The relationship between tools and their data sets is
provided in this
table.
- Mapping data version: A stable release can be equipped with more
recent data by defining a new data version. This is convenient when
new data files need to be used between software updates. Our data versions
usually follow the release cycle of the Ensembl data base.
- Input has header: Since tools add columns to tab separated input
files, in many cases it is informative to include column names for these
added columns. Check this box if the input file does have a header in the
first line, which will then be interpreted a line where the column names
are written to.
- Empty cell value: When processing data in an automated fashion,
it is very helpful to know when a tool could not compute results. In such
cases a special code for "not available" is inserted, which in our tool
suite defaults to two dashes, "--".
At the end of each tool, in the section "Full reference" there is a
link to an HTML page with extensive online documentation, describing in more
detail the method, available options, and output and providing small examples.
Human Genome Data
The Dintor framework has been developed during the time when
human genome release 37 (GRCh37) was used as a reference genome. In the
meanwhile, release 38 (GRCh38) is available, and the framework takes this into
account. All Dintor releases prior to 2017-05 (2015-04,
2014-12, ...) work on GRCh37 only, as during that time this was the
predominantly used reference genome.
Dintor release 2017-05 is a special release that combines both
human reference genomes, GRCh37 and GRCh38. Data version v-00 is based on
Ensembl version 75, which was the last release that worked with GRCh37 data.
Data versions v-01 and v-02 use GRCh38 as a reference and are based on data
from Ensembl versions 84 and 88, respectively. This allows users to analyze
data for both human genome reference datasets in a single Dintor
release. Subsequent releases will incorporate GRCh38 data only. However,
tools can be run with any Dintor release and therefore GRCh37
data will remain available in the future.
Dintor release 2020-06 is the first release to integrate updated
GRCh37 from Ensembl. Data version v-00 delivers GRCh37-based results, whereas
data version v-01 returns results based on GRCh38. Orthology data are available
only for GRCh38 (v-01). Both data versions relate to the same Ensembl release
(r100 in this case). This scheme will hold for all future releases.
Dintor release 2021-12 is based on Ensembl 105 and has seen an
update for all major databases included in Dintor, except for
the GO database. We are unfortunately lacking the resources to align our GO API
with the current GO distribution.
Download
You may want to use the Dintor suite on your local system for
several reasons:
- Privacy. A local installation can be coupled to a local
installation of the database backends, so that no data will be queried
from the public servers in the Internet.
- Performance. Without sharing your backends with other users,
programs are expected to execute faster.
- Developing complex pipelines. When it comes to highly complex
pipelines, Galaxy may not longer be the ideal platform for their
implementation. As each tool can alternatively be called from the command
line, a shell script provides a much higher flexibility in pipeline design.
The Dintor framework is packaged in three logically separated
files for download:
- Source code,
dintor-src.tgz - Contains
all releases of the Dintor tools suite. Please see the
INSTALL file in the latest release after unpacking this tarball.
- Data,
dintor-data.tgz - Contains
all versions of data files used by the tools. This file is well over 500MB
in size, be prepared for longer download waiting time.
- Ensembl API,
ensembl-api.tgz - Contains
all necessary Ensembl APIs needed for the Perl modules to access Ensembl
data.
We keep record of release information and associated data files used by the
tools. This information is available in the source distribution as an
OpenOffice/LibreOffice file, but can also be
viewed directly.
Citing Dintor
If you find Dintor useful for your work, please cite the following publication:
Weichenberger CX, Blankenburg H, Palermo A, D'Elia Y, König E, Bernstein E,
Domingues FS (2015) Dintor: functional annotation of genomic and proteomic data.
BMC Genomics, 16:1080.
[PubMed]
[DOI]
EU General Data Protection Regulation
For information on the EU Regulation No. 2016/679 - General Data Protection Regulation
(GDPR) with respect to this web site, we refer to the
privacy policy statement on our main web site.
Galaxy
is an open, web-based platform for data intensive biomedical research. The
Galaxy team
is a part of
BX at
Penn State,
and the
Biology
and
Mathematics and Computer Science
departments at
Emory University.
The
Galaxy Project
is supported in part by
NHGRI,
NSF,
The Huck Institutes of the Life Sciences,
The Institute for CyberScience at Penn State,
and
Emory University.