Tutorials

In this tutorial section we present four example usages of the Dintor framework, introducing the suite with a basic example and then gradually advance in task complexity.

Basic example

In this introductory example, we find the genomic coordinates of a small set of SNPs identified in a recent GWA study on Parkinson's Disease (Nalls et al, Nat Genet. 2014 Sep;46(9):989-93, PMID: 25064009, doi: 10.1038/ng.3043), where Table 1 lists representative dbSNP entries for loci found to be associated with Parkinson's Disease. The first few entries of the table are shown below:

dbSNP ID
rs35749011
rs823118
rs10797576
rs6430538
...

We import the data file into the active Galaxy dataset history by choosing the entry pd-gwas-nalls2014.tsv from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data.

The Dintor tool to accomplish this task is called snp2gcoords. In the Tools pane to the left we therefore select "snp2gcoords" from the "CBM/Dintor" group and execute the following steps.

snp2gcoords image
  1. Choose the Parkinson's disease data set from the GWA study performed by Nalls et al, see PMID: 25064009, doi: 10.1038/ng.3043.
  2. From the list of available Dintor releases, choose the latest version. Release naming is by year and month, YYYY-MM. In our example, the chosen release is "2014-12".
  3. Our example data file has a header line, the single present column is termed "dbSNP ID". We therefore inform the tool to treat the first line in the input data set as a header. For the appended columns, column names are thus written to this header line.
  4. Galaxy has found one column in the tab-separated input file. This is also the column number where our dbSNP IDs are located. We therefore select column number 1, "c1".
  5. Some Dintor releases span multiple versions of Ensembl. As our data updates follow the Ensembl release cycles, but we also want to process data originating from each Ensembl release, a data version number specifies which data we want to choose. For example, Dintor release 2013-12 includes both Ensembl 73 and 74, which can be selected by data version number v-00, and v-01, respectively.
  6. We are not interested in ambiguous dbSNP entries, which map to multiple locations in the human genome. Therefore, we check this box.
  7. We check this box, as we are interested in the reference and alternate alleles, as well as the strand this SNP referes to. This will add three columns to the output, "Ref", "Alt" and "Strand".
  8. Even though we have indicated above that we do wish to exclude ambiguously mapped SNP locations, Ensembl defines special regions such as the major histocompatibility complex (MHC), or stable locus reference genomic (LRG). These are excluded by marking the checkbox.
  9. Press the Execute button to start the computation.

After successful completion of the snp2gcoords tool, we will find an output dataset with the following lines (genomic coordinates refer to GRCh37):

dbSNP ID    Genomic coordinates  Ref  Alt  Strand
rs35749011  GRCh37:1:155135036   G    A    +
rs823118    GRCh37:1:205723572   C    T    +
rs10797576  GRCh37:1:232664611   C    T    +
rs6430538   GRCh37:2:135539967   C    T    +
...

If the Dintor framework has been installed locally, the same result can be obtained by running the following command in the command line:

snp2gcoords.pl --np --data-version v-00 -c 1 -H -u --inc-ref-alt
  --only-std-chrom <pd-gwas-nalls2014.tsv

This completes the first example on Dintor usage.

A gene prioritization example

In this tutorial we will perform an exemplary disease gene prioritization. By integrating various heterogenous genomic and proteomic data sets into our analysis, we will try to identify the most promising candidatates from an input list of candidate disease genes. Such an input gene set may, for instance, result from a functional proteomics experiment (e.g. an affinity purification) or may be defined by a genomic range (e.g. all genes within a certain distance of a specific SNP). In order to identify the candidate genes that are most likely involved in a particular phenotype like a disease of interest or a general aspect of human health, the prioritization tool requires a second set of genes that are known to be relevant, in the following named training or seed genes. For more general information on integrative disease gene prioritization, see for example the publication by Aerts et al. (Nat Biotech., 2006 May;24(5):537-544. PMID: 16680138, doi: 10.1038/nbt1203).

In our example we will try to replicate the finding that the KCNH1 gene is important for epilepsy, which was recently reported by Simons et al. (Nat Genet. 2014 Nov 24. [Epub ahead of print] PMID: 25420144, doi: 10.1038/ng.3153).

Workflow preparation

Our candidate input genes are taken from a genomic region encompassing seven megabases on either side of the KCNH1 gene. The 106 genes located in this region are listed with their Ensembl Gene identifiers in the file ensg_candidates.tsv. We obtain our seed genes from a publication on the genetics of epilepsy by Busch et al.. (Epilepsy Behav. 2014 Dec;41C:297-306. PMID: 24973143, doi: 10.1016/j.yebeh.2014.05.026). Those 18 genes, also identified by their Ensembl gene identifier, are given in the file ensg_seeds.tsv. Both files can be imported into the active Galaxy history by choosing them from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data.

The Dintor tool that we will use in this tutorial is named "Prioritizer", thus we select the tool in the Tools panel within the "CBM/Dintor" group. After we have imported the two example data files into our active Galaxy history as described above, we will need to perform the following steps:

  1. Select the file ensg_candidates.tsv as input dataset.
  2. Choose CBM/Dintor release "2014-12" to fix the external databases and reproduce the results below.
  3. Choose the file ensg_seeds.tsv in the gene panel file name selection.
  4. Select "ensg" as the input data type, as both our files use Ensembl gene identifiers. Other supported options would be UniProtKB accession numbers or Entrez gene identifiers.

After hitting the execute button we will have to wait for several minutes while the computation is taking place.

Result interpretation

The tool will generate a number of output files. The first, "Prioritizer output file", contains the brief summary of the prioritization, listing only the candidates and their scores (which are also used for sorting the results, with the most promising candidates on the top). If we look at the first lines of this output file we see that two genes share the first rank: KCNK2 (ENSG00000082482) and KCNH1 (ENSG00000143473), the gene that was reported to be relevant for epilepsy by Simons et al.:

Human Ensembl Gene ID	Meta Rank Score (Human Ensembl Gene ID)
ENSG00000082482	        -7.995643604
ENSG00000143473	        -7.995643604
ENSG00000163531	        -7.667139537
ENSG00000162889	        -6.455198563
ENSG00000184144	        -5.693058511
...

One potential problem of disease gene prioritization is the bias towards prior knowledge, as published results will eventually end up in biological databases like the Gene Ontology (GO) or UniProtKB, and will thus be used in the prioritiation process. However, as the disease association between KCNH1 and epilepsy has only been reported in this advanced access publication, we can safely exclude any bias in our example.

If we want to investigate, why a certain gene like KCNH1 achieved a particular score/rank, we can do so by looking at the two other output files. The "Prioritizer detail file" reports additional gene identifiers (columns 2 and 3) and a summary of all the individual data sources used in the prioritization. For KCNH1 (ENSG00000143473), we can see that no protein complex or protein interaction to any of the seed genes is reported in the integrative interaction database iRefIndex (columns 4-10). However, KCNH1 is reported to be involved in a biochemical reaction with one seed gene (columns 11 and 12) and, more broadly, share a Reactome pathway with five other seeds (columns 13 and 14). We can also see that the functional annotations, in particular the GO biological process annotation, of KCNH1 are very similar to 13 seed genes (columns 15 and 16). Even more detail is provided in the "Prioritizer full file". This file additionally lists information like the original database identifiers, allowing to backtrace any of the annotations to its original source.

If the Dintor framework has been installed locally, the same result can be obtained by running the following command in the command line:

python Prioritizer.py --in ensg --from-file ensg_candidates.tsv
  --panel-file-gene ensg_seeds.tsv --output-file-details output_details.tsv
  --output-file-full output_full.tsv --data-version v-00

A GWA analysis workflow*

We take as an example the Parkinson's Disease GWA study of Nalls et al. (PMID: 25064009, doi: 10.1038/ng.3043) and examine genes of loci associated with Parkinson's Disease presented in Table 1 of the publication.

Workflow preparation

The data file is the same as in the initial basic example. We import it into the active Galaxy dataset history by choosing the entry pd-gwas-nalls2014.tsv from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data. The workflow* can be imported accordingly through Shared Data -> Published Workflows and by selecting Import in the dropdown box next to the name PD GWAS Tutorial.

The workflow is imported as follows. From the dropdown menu Shared Data -> Published Workflows select PD GWAS Tutorial -> Import. After successful import you click on the link start using this workflow. Access the newly imported workflow imported: PD GWAS Tutorial -> Run. The previously imported input file pd-gwas-nalls2014.tsv is supplied in the first line, dbSNP data. Clicking the button Run workflow will start the computation, which should be done within a minute.

Result interpretation

The processing steps in the workflow are summarized graphically as follows:

PD GWAS workflow

The input data set is provided as the first step, which is the only step requiring user interaction. After this, dbSNP IDs are converted to human GRCh37 coordinates by the snp2gcoords tool, which in turn are used to find the encompassing LD blocks by the Pos2LDBlock module (see PMID: 24423111, doi: 10.1186/1471-2105-15-10 for further information on computation of LD blocks). We then find the protein coding genes that are contained in the LD blocks (Interval2Genes) and get their HGNC symbols (HSGeneIdConverter). The next step determines fly orthologs and is carried out by the HSGeneOrthologyMapper tool. The resulting flybase gene identifiers are converted to both CG identifiers and Vienna Drosophila Resource Center (VDRC) transformant IDs by using the DMGeneIdConverter tool.

If you have a local installation of the Dintor tools, you can also run the workflow, which has also been implemented as the shell script Dintor_GWAS.sh.

An NGS analysis workflow*

This workflow implements selection and annotation of variants from a next-generation sequencing experiment. First, variants are filtered so that the sample genotypes match an expected pattern of inheritance in samples affected and unaffected by a Mendelian disease. Then the variants are are annotated with information from the Ensembl and ClinVar databases.

Workflow preparation

This workflow requires two input files. We import them into the active Galaxy history by choosing the entries cohort.vcf and cohort.ped from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data. The workflow can be imported accordingly through Shared Data -> Published Workflows and by selecting Import in the dropdown box next to the name NGS filter annotate. A new page loads where we select start using this workflow. On the dropdown menu of the NGS filter annotate workflow we select Run. The workflow is displayed and now we need to select the input files. Under Step 1: VCF file we select the file cohort.vcf and under Step 2: PED file we select the file cohort.ped. Then we start the computation by clicking on Run workflow at the bottom of the page. Execution of the workflow takes roughly a minute.

Result interpretation

The processing steps in the workflow are shown below:

PD GWAS workflow

In this example, the cohort.vcf file has variant calls of six samples, of which two are affected by an autosomal recessive disease. The four remaining samples are healthy controls. The file cohort.ped specifies the gender and phenotype of all six samples.

The first tool MendelianFilter selects variants in cohort.vcf that have a homozygous alternative genotype in the two affected samples and homozygous reference or heterozygous alternative genotypes in the unaffected samples. (In a use case where no variant selection based on case-control comparisons is desired, we can copy this workflow to create a new one and then remove the first tool.) In this case, no ped file is neccessary. The second tool VCF2Dint converts the VCF file into the genomic coordinate format used by the Dintor tools. The remaining three tools gcoords2cons, gcoordsconservation, and ClinVarAnnotator annotate the variants with various information from the Ensembl database (e.g. variant consequence, gene ids, HGVS ids, SIFT score, GERP score), and from the ClinVar database (e.g. clinical significance).

If you have a local installation of the Dintor tools, you can also run the workflow, which has also been implemented as the shell script Dintor_NGS.sh.


* These tutorial examples make use of workflows and require a login for Galaxy session management. Registration is straightforward through the dropdown menu User -> Register and requires only an email address and a password. The account is immediately available without any further confirmation emails or alike.