Data Integrator
|
The functional similarity module, GOFunSim, builds upon the populated gene ontology (GO) graph which consists of nodes with attributes: t_id
(GO term identifier), name
(GO term name) cnt
and freq
(computed values for the node's term count and its term count contributed by more specialized terms, ie. child terms), and edges with the GO term relationships.
This tool allows to compute various information content based measures, such as semantic similarity measures, for pairs of GO terms (or proteins) and combine them with functional similatiry measures. The semantic similarity measures always depend on an ontology that can be chosen between :
Semantic similarity measures allow the quantification of the similarity between two or more terms within an ontology [1]. Most of them are based on the MICA-approach (maximum informative common ancestor), which selects the common ancestor between two terms with the highest information content I(c),
where p(c) is the probability of a term to occur.
Resnik [2] embraces this concept as it appears in his formula for the calculation of semantic similariy between two terms c1 and c2,
where S(c1, c2) is the set of all common ancestors of the two terms.
Lin's measure [3] also considers the information content and the common ancestors of the terms in order to describe the commonality between them:
The IC measure [4] (semantic similarity based on information coefficient) integrates information content and structural relationship:
The semantic similarity measure based on graph information content, GIC [5], consists of the sum of the information content of each term in the intersection of two sets, one containing the ancestors of proteins A (GO(A)) and the other one containing the ancestors of protein B (GO(B)), divided by the sum of the information content of each term in the union of the same sets:
The semantic similarity according to Schlicker. [6], combines Lin's and Resnik's similarity measures in order to take relevance information into account:
The last semantic similarity measure that appears in the GOFunSim tool, is Jiang and Conrath’s measure [7], similar to Lin's measure, considers the maximun information common ancestor and the input terms:
As mentioned above, these measures can be combined with functional similarity measures, which compute avg, max
and bma
over the semantic simililarity matrix produced by the chosen semantic similarity measure. The semantic similarity matrix S(A,B) is composed of pairwise semantic similarities sA(i),B(j) of m GO annotations of protein A (rows) versus n GO annotations of protein B (columns),
The functional similarity measures, considered in this tool, are the following :
avg
: Computes average over all matrix entries. max
: Computes the maximum value of the semantic similarity matrix. rcavgmax
: Computes the maximum of the averaged row and column maxima of the semantic similarity matrix. bma
(best match average): Computes the sum of the row maxima and the sum of the column maxima and normalizes this by dividing the number of columns and rows of the semantic similarity matrix. bma2
(best match average averaged): Computes the average of the row maxima and the average of the column maxima of the semantic similarity matrix and divides the sum of these two values by two. GOFunSim is a command line tool and by supplying a proper tab separated file, it allows to
the –in
option indicates the input type which allows to perfom the right operation. In this case the possible types of input are:
UniProt
, for computing semantic similarity for pairs of proteins specified by UniProt/Swiss-Prot accession numbers; GOTerm
, for computing semantic similarity for pairs of GO term IDs; GOTerm_IC
, for computing the information content of a GO term ID. Depending on the initial choice, we supply one column if we are interested in returning the information content of a GO term or if we want to compare our input file with an additional panel file, otherwise we supply two columns. The ontology namespace has to be provided by option –ontology
and also from a populated graph supplied by –ontology-graph-file
option. If no graph is provided, this is taken from the internal collection of precomputed graphs.
Functional similarity is based on semantic similarity measures which are chosen via option –semantic-similarity
with possible arguments resnik
, lin
, ic
, gic
, rel
, jc
for Resnik's, Lin's, information coefficient, graph information content, Schlicker's and Jiang & Conrath's measure, respectively. These measures are defined for pairs of proteins. A protein is usually equipped with multiple GO identifiers and they can be combined in different ways using the option –functional-similarity
with arguments described above. In both cases it is possible to supply multiple semantic similarity measures, each combination is then computed. So for n different semantic similarity measures and m different functional similarity measures, m * n values will be computed.
Functional similarity measures cannot be computed for pairs of GO terms, but we can choose multiple semantic similarity measures.
As mentioned before, it is also possibile to compute similarity between one protein and a set of proteins. This can be done supplying the –panel-file
option which specifies the file containing the proteins we want to consider in the comparison. Using the –panel-file
we can also choose to collapse into a single line the UniProt/Swiss-Prot accession numners of the similar proteins from the panel file. The number of functionally similar proteins is reported in this case, too. However, it is possible to select only one single semantic and functional similarity measure, and we need to supply the proper options that allow these operations, that is –collapse
and –count
. Lastly, we can also specify a threshold with the option –threshold
that restricts output to functional or semantic similarities higher than this threshold.
Options applicable to more than a single tool are summarized in common command line options.
Since we have multiple input type options, we will have different output depending on the initial type. Again, the basic idea is to append to the table the semantic similarity measures applied. In particular, for the protein/protein comparison for each functional similarity measure all chosen semantic similarity measures are added. If there is a column header, each column contains the ontology namespace (BP, MF, or CC), the name of the measure computed and the threshold value, if present. Only if we choose to get the information content, the output will have a single new column with the IC score added.
For example, given the input file /tmp/GoFunSim-in.tsv
, we want to perform a protein/protein comparison :
Prot1 Prot2 Q07837 Q07837
If we compute average and maximum functional similarity for semantic similarity measures according to Lin, Resnik, and information coefficient:
$ python GOFunSim.py -H -c 1 2 --in UniProt --ontology BP --functional-similarity avg max --semantic-similarity lin resnik ic --from-file /tmp/GoFunSim-in.tsv
the output table is expected to look like this:
Prot1 Prot2 BP avg lin BP avg resnik BP avg ic BP max lin BP max resnik BP max ic Q07837 Q07837 0.45 1.14 0.26 1.00 6.15 0.86
Given another input file /tmp/GoFunSim-go-in.tsv
, if we want to perform pairwise GO term comparison,
GO_Term1 GO_Term2 GO:0000001 GO:0007005
If we compute semantic similarity measures accordin to Schlicker, Lin, Resnik and Jiang & Conrath:
$ python GOFunSim.py -H -c 1 2 --in GOTerm --ontology BP --semantic-similarity rel lin resnik jc --from-file /tmp/GoFunSim-go-in.tsv
the output table is the following:
GO_Term1 GO_Term2 BP rel BP lin BP resnik BP jc GO:0000001 GO:0007005 0.74 0.74 3.57 0.28
If we are interested in retrieving the information content for a specific GO term, given the following input file /tmp/GOFunSim-go-ic-in.tsv
:
GO_Term GO:0007005 GO:0000001
with this command line specification :
$ python GOFunSim.py -H -c 1 --in GOTerm --ontology BP --information-content --from-file /tmp/GOFunSim-go-ic-in.tsv
we will obtain the following table :
GO_Term Information Content GO:0007005 3.57 GO:0000001 6.10
[1] Guzzi PH, Marano M, Guerra C, Cannataro M. (2012) Semantic similarity analysis of protein data: assessment with biological features and issues. Briefings in Bioinformatics. 13, 569-585. [PMID 22138322]
[2] Resnik P. (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the14th International Joint Conference on Artificial Intelligence IJCAI. 1, 448-453. [arxiv 9511007]
[3] Lin D. (1998) An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning. 1, 296–304.
[4] Li B, Wang JZ, Feltus FA, et al . (2010) Effectively integrating information content and structural relationship to improve the go-based similarity measure between proteins. Arxiv preprint arXiv:1001.0958. 1–54. [arxiv 1001.0958]
[5] Pesquita C, Faria D, Bastos H, et al. (2008) Metrics for GO based protein semantic sim- ilarity: a systematic evaluation. BMC Bioinformatics. 9(Suppl 5):S4. [PMID 18460186]
[6] Schlicker A, Domingues FS, Rahnenfuhrer J, et al. (2006) A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 7:302. [PMID 16776819]
[7] Jiang JJ, Conrath DW. (1997) Semantic similarity based on corpus statistics and lexical taxonomy. Arxiv preprint cmp-lg/ 9709008, no. Rocling X. International Conference Research on Computational Linguistics (ROCLING X), 9008. [arxiv 9709008]