Data Integrator
GOGraphBuilder - Generation of graphs furnished with annotation data

The GOGraphBuilder tool generates a graph (data structure consisting of nodes and edges), in GraphML file format (an easy to comprehend and intuitive file format for graph representation based on XML syntax), furnished with annotation data. The graphs can be built retrieving information for nodes and edges from an original GraphML file only containing the Gene Ontology (GO) term ID and name, or an OBO file. Both the GraphML and the OBO file contain terms representing gene product properties under the three Gene Ontologies (also known as domains):

  • Cellular component (CC): the parts of a cell or its extracellular environment
  • Molecular function (MF): the elemental activities of a gene product at the molecular level
  • Biological process (BP): operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units

The OBO file can be directly downloaded from Gene Ontology, and it is the text file format used to view and edit gene ontologies. This file consists of a header and a series of stanzas. There are three types of stanzas indicated in square brackets in the first line:

  • Term
  • Typedef
  • Instance

Header and stanzas contain fields, represented in the form tag : value. The term stanza, contains the GO terms with their description and relations. This is the only stanza considered by this tool.

The GraphML file, which can be also used as input file, is an internally used format file, obtained running the –export-graph option in the GOAnnotator - Access Gene Onotology Annotation module. This original file consists only of nodes representing the GO terms with their attributes

  • GO Term identifier called t_id
  • GO Term name called name

and edges that represent the relations between GO terms contain the relationship attribute. There are five types of relationships which relate GO terms:

  • is_a
  • part_of
  • regulates
  • positively_regulates
  • negatively_regulates

Running the tool, for each node count (number of gene products annotated with a term in the database) and frequency (term count contributed by more specialized terms (ie. child terms)) are calculated considering the annotation file provided (GO annotation mapping available through the GO web site).

Input

GOGraphBuilder is a command line tool which takes two inputs: the gene ontology (GO) file containing the GO terms with related information, provided by option –ontology-graph-file, and its type, specified by option –ontology-graph-type, that can be either a GraphML file or a GO OBO file. In the GraphML file the nodes contain the GO IDs stored in the t_id field and term names stored in the name field. Edges encode for the directionality by referencing source and target to node IDs and the type of relationship between two nodes is given by the relationship field. If the option –remove-edges is supplied, edges of the specified relationship type will be removed. The option –ontology is set to specify the GO ontology (CC, MF, BP). This option also implies extraction of data items from the second input, the GO annotation file, which must be in GAF format and which is specified by option –annotation-file. The set filtering has currently only a single choice (–annotation-filter-set), which is to exclude electronically inferred annotations (IEA). More specialized filters can be added by altering the source code only.

Options applicable to more than a single tool are summarized in common command line options.

Output

The tool returns a GraphML file which consists of nodes that represent the GO terms. Every node contains the following attributes:

  • t_id: GO term identifier
  • name: GO term name
  • cnt: Number of gene products annotated with a term in the database
  • freq: Number of gene products annotated with a term in the database and of all its children

Every edge represents the relationship between GO terms, as given by node attribute relationship.