Data Integrator
GOFunSim - Computes similarity between pairs of proteins.

The functional similarity module, GOFunSim, builds upon the populated gene ontology (GO) graph which consists of nodes with attributes: t_id (GO term identifier), name (GO term name) cnt and freq (computed values for the node's term count and its term count contributed by more specialized terms, ie. child terms), and edges with the GO term relationships.

This tool allows to compute various information content based measures, such as semantic similarity measures, for pairs of GO terms (or proteins) and combine them with functional similatiry measures. The semantic similarity measures always depend on an ontology that can be chosen between :

  • Cellular component,
  • molecular function,
  • biological process

Semantic similarity measures allow the quantification of the similarity between two or more terms within an ontology [1]. Most of them are based on the MICA-approach (maximum informative common ancestor), which selects the common ancestor between two terms with the highest information content I(c),

\[I(c) := - \log p(c)\]

where p(c) is the probability of a term to occur.

Resnik [2] embraces this concept as it appears in his formula for the calculation of semantic similariy between two terms c1 and c2,

\[sim_\mathrm{Resnik}(c_1,c_2) = \max_{c \in S(c_1,c_2)} I(c),\]

where S(c1, c2) is the set of all common ancestors of the two terms.

Lin's measure [3] also considers the information content and the common ancestors of the terms in order to describe the commonality between them:

\[sim_\mathrm{Lin}(c_1,c_2) = \max_{c \in S(c_1,c_2)} \left( \frac{2 \cdot I(c)}{I(c_1) + I(c_2)} \right) \]

The IC measure [4] (semantic similarity based on information coefficient) integrates information content and structural relationship:

\[sim_\mathrm{IC}(c_1,c_2) = \frac{2 \cdot \max_{c \in S(c_1,c_2)}I(c)} {I(c_1) + I(c_2)} \cdot \left( 1 - \frac{1}{1 - \max_{c \in S(c_1,c_2)}I(c)} \right) \]

The semantic similarity measure based on graph information content, GIC [5], consists of the sum of the information content of each term in the intersection of two sets, one containing the ancestors of proteins A (GO(A)) and the other one containing the ancestors of protein B (GO(B)), divided by the sum of the information content of each term in the union of the same sets:

\[sim_\mathrm{GIC}(A,B) = \frac {\sum_{c \in \{GO(A) \cap GO(B)\}} I(c)} {\sum_{c \in \{GO(A) \cup GO(B)\}} I(c)} \]

The semantic similarity according to Schlicker. [6], combines Lin's and Resnik's similarity measures in order to take relevance information into account:

\[sim_\mathrm{Rel}(c_1,c_2) = \max_{c \in S(c_1,c_2)} \left( \frac{ 2 \cdot I(c)} {I(c_1) + I(c_2)} \cdot \left( 1 - p(c) \right) \right) \]

The last semantic similarity measure that appears in the GOFunSim tool, is Jiang and Conrath’s measure [7], similar to Lin's measure, considers the maximun information common ancestor and the input terms:

\[sim_\mathrm{JC}(c_1,c_2) = \frac {1} {1 + I(c_1) + I(c_2) - 2 \cdot \max_{c \in S(c_1,c_2)} I(c)} \]

As mentioned above, these measures can be combined with functional similarity measures, which compute avg, max and bma over the semantic simililarity matrix produced by the chosen semantic similarity measure. The semantic similarity matrix S(A,B) is composed of pairwise semantic similarities sA(i),B(j) of m GO annotations of protein A (rows) versus n GO annotations of protein B (columns),

\[ S(A,B) := \left( \begin{array}{llll} s_{A(1),B(1)} & s_{A(1),B(2)} & \ldots & s_{A(1),B(n)} \\ s_{A(2),B(1)} & s_{A(2),B(2)} & \ldots & s_{A(2),B(n)} \\ \vdots & \vdots & \ddots & \vdots \\ s_{A(m),B(1)} & s_{A(m),B(2)} & \ldots & s_{A(m),B(n)} \\ \end{array} \right). \]

The functional similarity measures, considered in this tool, are the following :

  • avg: Computes average over all matrix entries.
  • max: Computes the maximum value of the semantic similarity matrix.
  • rcavgmax: Computes the maximum of the averaged row and column maxima of the semantic similarity matrix.
  • bma (best match average): Computes the sum of the row maxima and the sum of the column maxima and normalizes this by dividing the number of columns and rows of the semantic similarity matrix.
  • bma2 (best match average averaged): Computes the average of the row maxima and the average of the column maxima of the semantic similarity matrix and divides the sum of these two values by two.

Input

GOFunSim is a command line tool and by supplying a proper tab separated file, it allows to

  • compute semantic similarity between pairs of proteins,
  • compute semantic similarity between pairs of GO terms, and
  • retrieve the Information Content for a specific GO terms.

the –in option indicates the input type which allows to perfom the right operation. In this case the possible types of input are:

  • UniProt, for computing semantic similarity for pairs of proteins specified by UniProt/Swiss-Prot accession numbers;
  • GOTerm, for computing semantic similarity for pairs of GO term IDs;
  • GOTerm_IC, for computing the information content of a GO term ID.

Depending on the initial choice, we supply one column if we are interested in returning the information content of a GO term or if we want to compare our input file with an additional panel file, otherwise we supply two columns. The ontology namespace has to be provided by option –ontology and also from a populated graph supplied by –ontology-graph-file option. If no graph is provided, this is taken from the internal collection of precomputed graphs.

Functional similarity is based on semantic similarity measures which are chosen via option –semantic-similarity with possible arguments resnik, lin, ic, gic, rel, jc for Resnik's, Lin's, information coefficient, graph information content, Schlicker's and Jiang & Conrath's measure, respectively. These measures are defined for pairs of proteins. A protein is usually equipped with multiple GO identifiers and they can be combined in different ways using the option –functional-similarity with arguments described above. In both cases it is possible to supply multiple semantic similarity measures, each combination is then computed. So for n different semantic similarity measures and m different functional similarity measures, m * n values will be computed.

Functional similarity measures cannot be computed for pairs of GO terms, but we can choose multiple semantic similarity measures.

As mentioned before, it is also possibile to compute similarity between one protein and a set of proteins. This can be done supplying the –panel-file option which specifies the file containing the proteins we want to consider in the comparison. Using the –panel-file we can also choose to collapse into a single line the UniProt/Swiss-Prot accession numners of the similar proteins from the panel file. The number of functionally similar proteins is reported in this case, too. However, it is possible to select only one single semantic and functional similarity measure, and we need to supply the proper options that allow these operations, that is –collapse and –count. Lastly, we can also specify a threshold with the option –threshold that restricts output to functional or semantic similarities higher than this threshold.

Options applicable to more than a single tool are summarized in common command line options.

Output

Since we have multiple input type options, we will have different output depending on the initial type. Again, the basic idea is to append to the table the semantic similarity measures applied. In particular, for the protein/protein comparison for each functional similarity measure all chosen semantic similarity measures are added. If there is a column header, each column contains the ontology namespace (BP, MF, or CC), the name of the measure computed and the threshold value, if present. Only if we choose to get the information content, the output will have a single new column with the IC score added.

For example, given the input file /tmp/GoFunSim-in.tsv, we want to perform a protein/protein comparison :

Prot1    Prot2
Q07837   Q07837

If we compute average and maximum functional similarity for semantic similarity measures according to Lin, Resnik, and information coefficient:

$ python GOFunSim.py -H -c 1 2 --in UniProt --ontology BP --functional-similarity avg max --semantic-similarity lin resnik ic --from-file /tmp/GoFunSim-in.tsv

the output table is expected to look like this:

Prot1    Prot2    BP avg lin    BP avg resnik    BP avg ic    BP max lin    BP max resnik    BP max ic
Q07837   Q07837   0.45          1.14             0.26         1.00          6.15             0.86

Given another input file /tmp/GoFunSim-go-in.tsv, if we want to perform pairwise GO term comparison,

GO_Term1    GO_Term2
GO:0000001  GO:0007005

If we compute semantic similarity measures accordin to Schlicker, Lin, Resnik and Jiang & Conrath:

$ python GOFunSim.py -H -c 1 2 --in GOTerm --ontology BP --semantic-similarity rel lin resnik jc --from-file /tmp/GoFunSim-go-in.tsv

the output table is the following:

GO_Term1      GO_Term2      BP rel    BP lin    BP resnik    BP jc
GO:0000001    GO:0007005    0.74      0.74      3.57         0.28

If we are interested in retrieving the information content for a specific GO term, given the following input file /tmp/GOFunSim-go-ic-in.tsv:

GO_Term
GO:0007005
GO:0000001

with this command line specification :

$ python GOFunSim.py -H -c 1 --in GOTerm --ontology BP --information-content --from-file /tmp/GOFunSim-go-ic-in.tsv

we will obtain the following table :

GO_Term       Information Content
GO:0007005    3.57
GO:0000001    6.10

References

[1] Guzzi PH, Marano M, Guerra C, Cannataro M. (2012) Semantic similarity analysis of protein data: assessment with biological features and issues. Briefings in Bioinformatics. 13, 569-585. [PMID 22138322]

[2] Resnik P. (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the14th International Joint Conference on Artificial Intelligence IJCAI. 1, 448-453. [arxiv 9511007]

[3] Lin D. (1998) An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning. 1, 296–304.

[4] Li B, Wang JZ, Feltus FA, et al . (2010) Effectively integrating information content and structural relationship to improve the go-based similarity measure between proteins. Arxiv preprint arXiv:1001.0958. 1–54. [arxiv 1001.0958]

[5] Pesquita C, Faria D, Bastos H, et al. (2008) Metrics for GO based protein semantic sim- ilarity: a systematic evaluation. BMC Bioinformatics. 9(Suppl 5):S4. [PMID 18460186]

[6] Schlicker A, Domingues FS, Rahnenfuhrer J, et al. (2006) A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 7:302. [PMID 16776819]

[7] Jiang JJ, Conrath DW. (1997) Semantic similarity based on corpus statistics and lexical taxonomy. Arxiv preprint cmp-lg/ 9709008, no. Rocling X. International Conference Research on Computational Linguistics (ROCLING X), 9008. [arxiv 9709008]