Data Integrator
|
The Prioritizer is a Dintor module for the prioritization of candidate genes or gene products, i.e., for identifying the most promising candidates from a (potentially long) input list. Of particular importance for this prioritization process is information about the respective context, i.e., genes and/or processes known to be involved in a certain disease or an aspect of human health. These knowledge or seed panels provide the basis for a guilt-by-association approach to determine connections between candidates and disease genes/processes. The Prioritizer combines a number of Dintor tools to incorporate information about protein interactions and protein complexes, biochemical pathways, and functional gene annotations. The different data types are ultimately combined into a single ranked list of candidates using a MetaRanker approach (see MetaRanker - Combining multiple score columns into a single score).
The main input for the Prioritizer is a file containing a list of gene or protein identifiers describing the candidates (defined in the command line via the –from-file
). As different bioinformatics resources use various identifier systems (e.g. disease mutations are annotated to Entrez genes, annotations or biochemical pathways to UniProt/SwissProt proteins), the Prioritizer supports a range of identifiers as input (defined via the –in
command-line parameter), in particular, Entrez gene identifier (–in entrez
), UniProt accession number (–in uniprot
), and Ensembl gene identifier (–in ensg
). As part of the prioritization process, the input provided to the tool in one of these systems will be converted to the other identifier systems using Ensembl-based mapping tables. Note that as the identifier mappings are not always one-to-one (e.g. multiple Ensembl gene records exists for a single Entrez gene), different results might be obtained depending on the initial identifier system.
A second list of identifiers defines the gene knowledge panel (command-line parameter –panel-file-gene
). This list, which has to be in the same input identifier system as the list of candidates, is usually manually curated by experts in the respective fields, for example, incorporating information from disease-centered resources like OMIM or HGMD.
A third list can be provided to the Prioritizer in order to specify seed processes (command-line parameter –panel-file-go
). In contrast to the candidate and gene panel lists, this list has to contain Gene Ontology (GO) terms and will be used to screen the candidates for matching functional annotation.
An important design aspect of the Prioritizer module is to keep the whole prioritization process as simple as possible for the user. For example, we try to have useful default values for most of the tool input parameters, allowing tool execution without setting numerous options. Advanced users can still adjust the behavior of the module by changing those values, e.g., adjusting individual ranking weights to their personal preference. As a result, all input files are by default asumed to be flat text files containing only a single identifier column and no header. If the files of a user have a different layout, e.g. if the identifier is in a column other than the first or if a header line is present, then this can be specified via the respective parameters (using either the advanced tool options in the Galaxy interface or by setting the respective command-line option -H
, -c
, –panel-file-gene-h
, –panel-file-gene-c
, –panel-file-go-h
, –panel-file-go-c
).
The main output of the Prioritizer, an ordered list of candidates with their MetaRank score, is printed to the standard output (or, if you are using the Galaxy interface, is redirected to the main result file). This list is sorted and the candidates with the strongest associations to the input panels are on top. Only candidates for which some association to the knowledge panels has been found will be reported. For more details on the MetaRank score, see the documentation of the MetaRanker - Combining multiple score columns into a single score.
Two additional files can be generated to list further details on the prioritization process and the found associations (if you are using the Galaxy interface, these files will be automatically generated). The "details" file (requested by the –output-file-details
parameter) will not only report the genes and their overall scores, but will also list the number of protein interactions between the candidate and a panel/seed gene, the number of shared protein complexes, etc. As for the main output, only candidates with any association to the knowledge panels will be reported.
The even more comprehensive "full" file (requested with the –output-file-full
parameter) allows to backtrace each of found association to its origin, by e.g., also listing the source database of an interaction and the PubMed identifier of the publication where the interaction was reported. The "full" file also contains the individual ranks and weights that have been used for computing the overal scores and lists results for all candidates.
The "details" and "full" files contain a number of additional columns. For details on these columns, please see the documentation of the respective modules, i.e. InteractionAnnotator - Retrieve protein-protein interactions and co-complex data for protein interactions and complexes, ReactomeAnnotator - Retrieve curated Reactome data for Reactome biochemical reactions, GOAnnotator - Access Gene Onotology Annotation for functional annotation from the Gene Ontology, or GOFunSim - Computes similarity between pairs of proteins. for functional similarities.
[1] Aerts, S. et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol. 24(5):537-44. [PMID 16680138]