Data Integrator
GOAnnotator - Access Gene Onotology Annotation

The GOAnnototor provides access to information stored in the official Gene Ontology (GO) MySQL database. The GO consortium was initiated around the year 2000 with the goal of providing a structured vocabulary for the funtional annotation of genes and their products. GO is divided into three categories, describing different aspects of gene function:

  • biological process (BP)
  • molecular function (MF)
  • cellular component (CC)

Each ontology is implemented as directed acyclic graph (DAG) where terms may have multiple parents and multiple descendants. The individual ontologies are largely disjunct, although some terms exist that are descendants of both BP and MF. The root node of each category has the name of the category (i.e., BP, MF, or CC). GO is constantly updated, new terms may be added and/or old terms removed between releases. The GO consortium does not only provide the ontology terms and their structure and hierarchy but also offers annotations of genes and their products with GO terms. The full GO database (terms plus protein sequences and their annotations) is released monthly. As of early 2014, the full release requires around 100gb of MySQL storage space. However, the GOAnnotator also supports access to the public MySQL database hosted at the EBI. More information can be found at http://www.geneontology.org/

Input

The GOAnnotator is a command-line tool that operates on tabular flat files. It was developed particularly for two use-cases: the retrieval of GO terms, e.g., terms that contain a certain query string or descendants of query terms, and the annotation of gene products with GO terms.

The particular task is defined by its input (option –in) and can be further refined by the respective command-line parameters, which will be described in more detail below:

  1. –query-term: Input is a string, retrieve GO terms that contain the query string.
  2. –export-graph: No input, export the full or partial GO graph as a graphML XML document.
  3. –export-annotations: No input, export all annotations of UniProtKB proteins with GO terms in GO Annotation File (GAF) format 2.0 output.
  4. –in go: Input consists of GO terms, retrieve descendants of these terms and/or proteins annotated to them.
  5. –in uniprot: Input consists of UniProtKB accession numbers, retrieve GO annotations for these proteins.

The tabular input for the last two option is specified via the from-file parameter that specifies the input file (or the standard input via from-file - <<<'Something') and the -c parameter that defines the column in the input file/stream (numbering starts at 1). If the input has a header line, this can be specified via the -H option.

The option –query-term can be used to retrieve GO terms that contain a certain query string, e.g., "heart". If multiple search terms are provided, each of them is separated by a space, e.g. –query term heart rate, they are by default used to further restrict the search (logical "AND", i.e., retrieve all GO terms which contain "heart" and "rate"). This default behaviour can be changed by adding the –combine-terms operand with the value "OR", which would return all terms that contain either "heart" or "rate"). A further restriction of the search space can be done by using the –limit-term-category option with any combination of the values "BP", "CC", or "MF", which will limit the search to only the categories provided. The i

The option –export-graph can be used to export the full GO graph or parts thereof in a GraphML (see http://graphml.graphdrawing.org/) document. The options –limit-term-relationship and –limit-term-category can be added to limit the export graph to certain GO categories (BP,CC, or MF) or relationship types (is_a, part_of, regulates, positively_regulates, negatively_regulates). If no further parameter is provided, the complete graph comprising all categories will be exported.

The option –export-annotations will export all GO annotations of UniProtKB proteins in a GO Annotation File (GAF) format 2.0 output (see http://geneontology.org/page/go-annotation-file-gaf-format-20). The output can be limited to a certain category by using the –limit-term-category option, if no parameter is provided the full annotation file will be exported.

If the input is GO identifiers (–in go option) the tool will, depending on the additional parameters, determine descendant terms and/or proteins annotated with these.

By default, the GOAnnotator used with GO input will always include the query term, as this is a descendant of itself with the graph distance 0. The options min-distance and max-distance, both requiring a positive integer value, allow for limiting the search depth in the GO graph. min-distance x will require at least x relations between the query node and the output. For instance, min-distance 1 will not report the query nore, thus will only report real descendants. max-distance can be used to limit the maximum graph distance of a term from the query node. A value of 0 would report only the query GO term, a value of 1 would only report direct descendants, and so on.

For each of the reported descending GO terms, different information can be reported by the tool. The option –term-id will add a column with the GO identifier, –term-name will add a column with the term name, and –term-category will add a column with the GO category of the term.

The option –relationship will add a column listing the relationship type to the query term, e.g., "is_a" or "part_of". For more details on GO relationships, see http://www.geneontology.org/GO.ontology.relations.shtml. Note that the tool only lists the last relationship in the graph path from query to output term, not the complete path.

The option –distance will add two columns printing the shortest and the longest path between two points in the GO graph (these different ways might exists due to the DAG structure of the GO graph and the different relationship types between terms). In case the input is GO identifiers for which descending terms are requested(–in go), the distance will specifiy the path from query to output term. In case of Uniprot input (–in uniprot) for which GO annotation is retrieved, the distance will represent the path of the GO term to the respective root node, e.g. the biological process root in the BP category.

The option –ic will add the information content (IC) of the GO terms. The IC indicates, how informative a GO term is given the number of proteins that are annotated to it. Terms very high in the GO hierarchy are annotated to many proteins (direct or via their children) and have a low IC (e.g. 'regulation of biological process' with an IC < 1), very specific terms are annotated to few proteins and have a high IC ( 'regulation of carbohydrate utilization' with an IC of >7). The IC can also be used to restrict the results returned to the user via the option –threshold. If the option is used, the number provided will be used as a cutoff and only terms with an IC equal or greater to the number will be returned. The information content can only be used, if exactly one GO ontology is provided in the –limit-term-category described below.

By using the input option –uniprot, a column with the UniProtKB protein accession numbers annotated with a GO term will be added. By default, the GOAnnotator will only report manually curated Swiss-Prot proteins and will omit unreviewed Trembl entries. If the option –trembl is added to the input, the tool will report all UniProtKB proteins. Generally, GO only used UniProtKB accession numbers whenever refering to gene products. Gene or protein identifiers in other identifier systems like Ensembl or RefSeq have to be converted before/after using the GOAnnotator.

If the input for the GOAnnotator are UniProtKB accession numbers, these will be annotated with GO terms. Additional information on the GO terms can be requested with the options –term-id (adds a column with the GO identifier), –term-name (adds a column with the term name), –term-category (adds a column with the GO category of the term), and –distance (adds columns with the distance of the term to the ontology root).

The associations between GO terms and UniProtKB entries belong to different evidence classes, broadly devided into computational and manual annotation (see http://www.geneontology.org/GO.evidence.shtml for more detail). Using the option manual-only, term-protein associations that have been determined using computational inference (evidence code IEA) will not be considered, whereas if the option is not used, all associations will be reported.

The option –limit-term-category, followed by any combination of the three categories BP, MF, and CC, can be used to limit the annotation of gene products to terms originating from the respective categories.

The parameter –evidence will add an additional column with the GO evidence code for the association of gene product and GO term. See http://www.geneontology.org/GO.evidence.shtml for more detail.

Both input options, GO identifiers and UniProtKB accession numbers, support the use of panel files to restrict the annotation to certain terms or proteins. If the input is GO ids, for which descending terms and/or annotated proteins are requested, the panel file may contain UniProtKB accession numbers. In this case, only assocations between terms and those proteins contained in the panel file will be reported. If the input is UniProtKB accession numbers, which should be annotated with GO terms, the panel file may contain GO term identifiers. In this case, only those terms or descendants thereof will be reported, if they are annotated to any of the input proteins. In both cases, the panel file is defined via the –panel-file parameter. The parameter –panel-file-H can be added to indicate that the panel file has a header line that should be skipped. If the panel file contains multiple columns, the correct columns containing the UniProtKB accession numbers or the GO identifiers is indicated using the –panel-file-c parameter (as with -c, numbering starts at 1).

By using the –count parameter, a column summarizing the number of distinct results will be added.

The option –header-suffix can be used to add a particular string to the end of each column. This option can be useful, if multiple runs of the tool are performed, for instance, using different panel files.

Additional options applicable to more than a single tool (such as –help/-h, –data-version, –version, –empty-cell, –permissiveness) are summarized in common command line options.

Output

If the –query-terms option has been used, the GOAnnotator will report each matching GO term with its id, name, and category. A header line will be reported, whenever results have been found. If no matching terms have been found, the output will be empty.

The options –export-graph and –export-annotations will return output formatted as a GraphML XML or Gene Annotation Format 2.0.

If the tool is requested to return GO terms or UniProtKB accession numbers, each result will be returned as a new line, where the number of results depends on the particular options that have been chosen. By default, no new columns will be added to the input file; the user has to specify those explicitely via parameters.

Based on the selection, the output will contain additional columns labeled:

  • UniProtKB Acc (option –uniprot)
  • GO term ID (option –term-id)
  • GO term name (option –term-name)
  • GO category (option –term-category)
  • Graph distance min, Graph distance max (option –distance)
  • Relationship type (option –relationship)
  • Evidence code (option –evidence-code)
  • Information content (option –ic)
  • Count (option –count)

If the option –header-suffix is used, the string that was provided to this option will be added to the end of each new column.

If the –collapse option has been set, the output wil be condensed into the input line, i.e., no additional lines will be added to the input. If multiple values are present as the result of this, they will be joined by a "|". Note that this can potentially result in very lengthy lines.

If the option –clean is used, the output will be not be appended to the input lines, but will be reported individually. This can be useful, for instance, if all the non-redundant annotations of a set of UniProtKB proteins should be retrieved, without reporting GO terms multiple times.

Examples

The following examples illustrate different use cases and application possibilities for the GOAnnotator.

  • python GOAnnotator.py –query-term DNA complex –combine-terms AND –limit-term-category CC: Find all GO terms of the cellular component (CC) category, which contain both the terms DNA and complex.

  • python GOAnnotator.py –in uniprot -c 3 -H –from-file filein.tsv –term-id –term-name –panel-file panelfile.tsv –panel-file-c 1 –panel-file-H : For all the UniProtKB accession numbers in column 3 of the input file "filein.tsv", get all annotated GO terms, limiting to those terms that directly or indirectly (via their parent terms) listed in column 1 of the panel file "panelfile.tsv". Both the panel and the input file have a header line.

  • python GOAnnotator.py –query-term heart rate receptor signaling –limit-term-category BP | python -H -c 1 –in go –uniprot –manual-only –evidence –from-file -: First use the tool to determine all GO biological process terms that contains all four terms heart, rate, receptor and signaling. Pipe the results into a second iteration of the tool, this time adding all proteins UniProtKB Swiss-Prot protein accession numbers that are annotated to the terms and their descendants. Only report associations which have been confirmed by a human curator and report the evidence code.

  • Given the file /tmp/go-in.tsv,

    GO ID      GO name                                               GO category
    GO:0061083 regulation of protein refolding                       BP
    GO:0051085 chaperone mediated protein folding requiring cofactor BP
    GO:0005747 mitochondrial respiratory chain complex I             CC
    GO:0071443 tDNA binding                                          MF
    GO:0000000 invalid                                               invalid
    GO:asasasa --                                                    --
    

    the following command line

    python GOAnnotator.py --term-id --in go -c 1 -H --from-file /tmp/go-in.tsv --header-suffix " (out)"
    

    will result in this output:

    GO ID      GO name                                               GO category GO term ID (out)
    GO:0061083 regulation of protein refolding                       BP          GO:0061083
    GO:0061083 regulation of protein refolding                       BP          GO:0061084
    GO:0051085 chaperone mediated protein folding requiring cofactor BP          GO:0051085
    GO:0005747 mitochondrial respiratory chain complex I             CC          GO:0005747
    GO:0005747 mitochondrial respiratory chain complex I             CC          GO:0042652
    GO:0005747 mitochondrial respiratory chain complex I             CC          GO:0042653
    GO:0071443 tDNA binding                                          MF          GO:0071443
    GO:0000000 invalid                                               invalid     --
    GO:asasasa --                                                    --          --
    

References

[1]Ashburner M., (2000) Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-9. [PMID 10802651]