Tutorials |
In this introductory example, we find the genomic coordinates of a small set of SNPs identified in a recent GWA study on Parkinson's Disease (Nalls et al, Nat Genet. 2014 Sep;46(9):989-93, PMID: 25064009, doi: 10.1038/ng.3043), where Table 1 lists representative dbSNP entries for loci found to be associated with Parkinson's Disease. The first few entries of the table are shown below:
dbSNP ID rs35749011 rs823118 rs10797576 rs6430538 ...
We import the data file into the active Galaxy dataset history by choosing the entry pd-gwas-nalls2014.tsv from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data.
The Dintor tool to accomplish this task is called snp2gcoords. In the Tools pane to the left we therefore select "snp2gcoords" from the "CBM/Dintor" group and execute the following steps.
|
After successful completion of the snp2gcoords tool, we will find an output dataset with the following lines (genomic coordinates refer to GRCh37):
dbSNP ID Genomic coordinates Ref Alt Strand rs35749011 GRCh37:1:155135036 G A + rs823118 GRCh37:1:205723572 C T + rs10797576 GRCh37:1:232664611 C T + rs6430538 GRCh37:2:135539967 C T + ...
If the Dintor framework has been installed locally, the same result can be obtained by running the following command in the command line:
snp2gcoords.pl --np --data-version v-00 -c 1 -H -u --inc-ref-alt --only-std-chrom <pd-gwas-nalls2014.tsvThis completes the first example on Dintor usage.
In this tutorial we will perform an exemplary disease gene prioritization. By integrating various heterogenous genomic and proteomic data sets into our analysis, we will try to identify the most promising candidatates from an input list of candidate disease genes. Such an input gene set may, for instance, result from a functional proteomics experiment (e.g. an affinity purification) or may be defined by a genomic range (e.g. all genes within a certain distance of a specific SNP). In order to identify the candidate genes that are most likely involved in a particular phenotype like a disease of interest or a general aspect of human health, the prioritization tool requires a second set of genes that are known to be relevant, in the following named training or seed genes. For more general information on integrative disease gene prioritization, see for example the publication by Aerts et al. (Nat Biotech., 2006 May;24(5):537-544. PMID: 16680138, doi: 10.1038/nbt1203).
In our example we will try to replicate the finding that the KCNH1 gene is important for epilepsy, which was recently reported by Simons et al. (Nat Genet. 2014 Nov 24. [Epub ahead of print] PMID: 25420144, doi: 10.1038/ng.3153).
Our candidate input genes are taken from a genomic region encompassing seven megabases on either side of the KCNH1 gene. The 106 genes located in this region are listed with their Ensembl Gene identifiers in the file ensg_candidates.tsv. We obtain our seed genes from a publication on the genetics of epilepsy by Busch et al.. (Epilepsy Behav. 2014 Dec;41C:297-306. PMID: 24973143, doi: 10.1016/j.yebeh.2014.05.026). Those 18 genes, also identified by their Ensembl gene identifier, are given in the file ensg_seeds.tsv. Both files can be imported into the active Galaxy history by choosing them from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data.
The Dintor tool that we will use in this tutorial is named "Prioritizer", thus we select the tool in the Tools panel within the "CBM/Dintor" group. After we have imported the two example data files into our active Galaxy history as described above, we will need to perform the following steps:
After hitting the execute button we will have to wait for several minutes while the computation is taking place.
The tool will generate a number of output files. The first, "Prioritizer output file", contains the brief summary of the prioritization, listing only the candidates and their scores (which are also used for sorting the results, with the most promising candidates on the top). If we look at the first lines of this output file we see that two genes share the first rank: KCNK2 (ENSG00000082482) and KCNH1 (ENSG00000143473), the gene that was reported to be relevant for epilepsy by Simons et al.:
Human Ensembl Gene ID Meta Rank Score (Human Ensembl Gene ID) ENSG00000082482 -7.995643604 ENSG00000143473 -7.995643604 ENSG00000163531 -7.667139537 ENSG00000162889 -6.455198563 ENSG00000184144 -5.693058511 ...
One potential problem of disease gene prioritization is the bias towards prior knowledge, as published results will eventually end up in biological databases like the Gene Ontology (GO) or UniProtKB, and will thus be used in the prioritiation process. However, as the disease association between KCNH1 and epilepsy has only been reported in this advanced access publication, we can safely exclude any bias in our example.
If we want to investigate, why a certain gene like KCNH1 achieved a particular score/rank, we can do so by looking at the two other output files. The "Prioritizer detail file" reports additional gene identifiers (columns 2 and 3) and a summary of all the individual data sources used in the prioritization. For KCNH1 (ENSG00000143473), we can see that no protein complex or protein interaction to any of the seed genes is reported in the integrative interaction database iRefIndex (columns 4-10). However, KCNH1 is reported to be involved in a biochemical reaction with one seed gene (columns 11 and 12) and, more broadly, share a Reactome pathway with five other seeds (columns 13 and 14). We can also see that the functional annotations, in particular the GO biological process annotation, of KCNH1 are very similar to 13 seed genes (columns 15 and 16). Even more detail is provided in the "Prioritizer full file". This file additionally lists information like the original database identifiers, allowing to backtrace any of the annotations to its original source.
If the Dintor framework has been installed locally, the same result can be obtained by running the following command in the command line:
python Prioritizer.py --in ensg --from-file ensg_candidates.tsv --panel-file-gene ensg_seeds.tsv --output-file-details output_details.tsv --output-file-full output_full.tsv --data-version v-00
We take as an example the Parkinson's Disease GWA study of Nalls et al. (PMID: 25064009, doi: 10.1038/ng.3043) and examine genes of loci associated with Parkinson's Disease presented in Table 1 of the publication.
The data file is the same as in the initial basic example. We import it into the active Galaxy dataset history by choosing the entry pd-gwas-nalls2014.tsv from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data. The workflow* can be imported accordingly through Shared Data -> Published Workflows and by selecting Import in the dropdown box next to the name PD GWAS Tutorial.
The workflow is imported as follows. From the dropdown menu Shared Data -> Published Workflows select PD GWAS Tutorial -> Import. After successful import you click on the link start using this workflow. Access the newly imported workflow imported: PD GWAS Tutorial -> Run. The previously imported input file pd-gwas-nalls2014.tsv is supplied in the first line, dbSNP data. Clicking the button Run workflow will start the computation, which should be done within a minute.
The processing steps in the workflow are summarized graphically as follows:
The input data set is provided as the first step, which is the only step requiring user interaction. After this, dbSNP IDs are converted to human GRCh37 coordinates by the snp2gcoords tool, which in turn are used to find the encompassing LD blocks by the Pos2LDBlock module (see PMID: 24423111, doi: 10.1186/1471-2105-15-10 for further information on computation of LD blocks). We then find the protein coding genes that are contained in the LD blocks (Interval2Genes) and get their HGNC symbols (HSGeneIdConverter). The next step determines fly orthologs and is carried out by the HSGeneOrthologyMapper tool. The resulting flybase gene identifiers are converted to both CG identifiers and Vienna Drosophila Resource Center (VDRC) transformant IDs by using the DMGeneIdConverter tool.
If you have a local installation of the Dintor tools, you can also run the workflow, which has also been implemented as the shell script Dintor_GWAS.sh.
This workflow implements selection and annotation of variants from a next-generation sequencing experiment. First, variants are filtered so that the sample genotypes match an expected pattern of inheritance in samples affected and unaffected by a Mendelian disease. Then the variants are are annotated with information from the Ensembl and ClinVar databases.
This workflow requires two input files. We import them into the active Galaxy history by choosing the entries cohort.vcf and cohort.ped from the data library via the dropdown menu Shared Data -> Data Libraries, library name Tutorial data. The workflow can be imported accordingly through Shared Data -> Published Workflows and by selecting Import in the dropdown box next to the name NGS filter annotate. A new page loads where we select start using this workflow. On the dropdown menu of the NGS filter annotate workflow we select Run. The workflow is displayed and now we need to select the input files. Under Step 1: VCF file we select the file cohort.vcf and under Step 2: PED file we select the file cohort.ped. Then we start the computation by clicking on Run workflow at the bottom of the page. Execution of the workflow takes roughly a minute.
The processing steps in the workflow are shown below:
In this example, the cohort.vcf file has variant calls of six samples, of which two are affected by an autosomal recessive disease. The four remaining samples are healthy controls. The file cohort.ped specifies the gender and phenotype of all six samples.
The first tool MendelianFilter selects variants in cohort.vcf that have a homozygous alternative genotype in the two affected samples and homozygous reference or heterozygous alternative genotypes in the unaffected samples. (In a use case where no variant selection based on case-control comparisons is desired, we can copy this workflow to create a new one and then remove the first tool.) In this case, no ped file is neccessary. The second tool VCF2Dint converts the VCF file into the genomic coordinate format used by the Dintor tools. The remaining three tools gcoords2cons, gcoordsconservation, and ClinVarAnnotator annotate the variants with various information from the Ensembl database (e.g. variant consequence, gene ids, HGVS ids, SIFT score, GERP score), and from the ClinVar database (e.g. clinical significance).
If you have a local installation of the Dintor tools, you can also run the workflow, which has also been implemented as the shell script Dintor_NGS.sh.
* These tutorial examples make use of workflows and require a login for Galaxy session management. Registration is straightforward through the dropdown menu User -> Register and requires only an email address and a password. The account is immediately available without any further confirmation emails or alike.