Given a list of genes returned by any experimental method, with gene set enrichment analysis we define an analytical method that aims to give biological interpretation to this list. In particular, the general question that researchers want to answer with gene set enrichment analysis, is whether the list of genes shares any biological significance (i.e. biological function, chromosomal location, or regulation). The most commonly used database for enrichment information is the Gene Ontology (GO) [1]. In this case, the features checked for enrichment are GO Terms associated with proteins (the most prevalent type of gene product).

This GOEnricher module's goal is to perform gene set enrichment analysis on a set of human (or other species) proteins specified by UniProt/Swiss-Prot accession numbers, using GO Terms and a multiple testing-aware statistics.

Input

Two mandatory input parameters are needed for perfoming the analysis in the GOEnricher module:

Gene set to be enriched: this is given by a tabular file (name) and the column with protein identifiers.
Type of enrichment analysis to be done: Gene Ontology provides three ontologies: BP, CC or MF for biological process, cellular component, and molecular function, respectively. One of these must be chosen for enrichment analysis. A good starting point is by using the biological process (BP) ontology.

To obtain optimal results, a file needs to be defined that contains gene products that form the basis of the experiment (background) [2]. Again, the file may contain a head and a column number specifies the column where background identifiers are stored. If this option is not used, the background set is taken by selecting the annotated gene products available in the respective organism's annotation file/database.

Gene set enrichment is carried out by performing a Fisher's exact test. Three possible analysis types are available:

enrichment: considers the right tail of the hypergeometric distribution as its critical region ("greater") and tests for over-representation of terms in the gene set.
depletion: tests for under-representation of terms considering the left tail of the distribution ("less").
two-sided: tests if there is either over or under-representation of terms in the gene set.

(The default option is two-sided.)

It is possible to prune non-informative terms by selecting only terms associated with corrected p-values less or equal than a selected value, which is set to 0.5 by default.

Fisher's exact test is applied for each GO Termn associated to every protein in the set, this therefore increases the odds of finding a random term as significant by chance. For this reason, Bonferroni and Benjamini-Hochberg [3] corrections are offered.

Since GO uses evidence codes along with their protein annotations, we allow to filter for such codes. Any combination of evidence codes can be chosen.

Clustering by information content or distance from root, is also possible. In this case, any GO term with an information content/distance greater or equal to the specified threshold, will be put into a cluster with the least informative GO term chosen (or the GO term closest to the root in case of distance clustering) as the cluster representative.

Options applicable to more than a single tool are summarized in common command line options.

Output

By default, for each protein in the input table the tool outputs all its associated terms with their corrected p-value (taking into account any p-value threshold set).

The tool reports either corrected p-values or both uncorrected and corrected p-vaues. For archiving reasons and for reproducibility, the contingency table can be output for each GO term, each cell of the table will be added as a separate column in the result table. It's also possible to collapse output to single lines per input protein identifier, reporting only counts. For further information, the distance from the GO term to the root of the ontology can also be output. Finally, the tool's output can be detached from the input identifier system so that only GO terms and their p-values will be output.

For example, given the input file /tmp/simple.tsv,

UniProtAccNr
Q9BZZ5
Q9NQS1

the command for a two-sided test (enrichment or depletion) is

$ python GOEnricher.py -H -c1 --from-file /tmp/simple.tsv --enrichment-type BP

and will output the following lines:

UniProtAccNr      GO Term        p-value
Q9BZZ5            GO:2000270     8.25e-04
Q9BZZ5            GO:0043066     2.20e-03
Q9BZZ5            GO:0006915     1.09e-02
Q9NQS1            GO:0043066     2.20e-03
Q9NQS1            GO:0006915     1.09e-02

The command tests for enrichment (one-sided test, compard to the two-sided test above) and

$ python GOEnricher.py -H -c1 --from-file /tmp/simple.tsv --enrichment-type BP --testing-hypothesis enrichment --p-value-adjustment bonferroni --nonadjusted-p-value

will output the following lines with the specified additional information:

UniProtAccNr      GO Term        p-value (bonferroni)    Raw p-value
Q9BZZ5            GO:2000270     1.73e-02                8.25e-04
Q9BZZ5            GO:0043066     4.62e-02                2.20e-03
Q9NQS1            GO:0043066     4.62e-02                2.20e-03

The alternative output considers rather the terms than the gene products that are associated with. As mentioned above, option –independent-output, enables to output terms and p-values from enrichment analysis as two columns.

The command

$ python GOEnricher.py -H -c1 --from-file /tmp/simple.tsv --enrichment-type BP --testing-hypothesis enrichment --p-value-adjustment bonferroni --nonadjusted-p-value --independent-output

this time will output the following lines:

GO Term        p-value (bonferroni)     Raw p-value
GO:0043066     4.62e-02                 2.20e-03
GO:2000270     1.73e-02                 8.25e-04

References

[1] Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA (2003) Global functional profiling of gene expression. Genomics 81:98-104. [PMID 12620386 ]
[2] Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of lage gene lists. Nucleid Acids REs 37:1-13. [PMID 19033363 ]
[3] Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57:289-300 [JSTOR 2346101 ]