Data Integrator
InteractionAnnotator - Retrieve protein-protein interactions and co-complex data

The InteractionAnnotator provides information about binary protein-protein interactions and protein co-complex relationships. Currently, it is based on iRefIndex, an integrative database of protein interactions and protein complexes. Integrative as it combines the data of various primary interaction databases (e.g. MINT, IntAct, BioGRID, DIP) into one repository. Given the huge diversity of interaction databases, this is a valuable effort, as it performs critical steps like identifier mapping and interaction unification.

Like most of today’s interaction databases, iRefIndex provides its data in the HUPO-PSI MITAB format. MITAB is a tab-delimited flat file format, the simple little brother of the very complex HUPO-PSI MI XML format. In its original specification 2.5 it defines 15 columns. Versions 2.6 and 2.7 are also out there, but are not widely supported yet. As we currently do not need the information specified in the additional columns - though some would be helpful - we simply ignore everything beyond the 15th column.

MITAB is also the basis for the PSICQUIC system ( PSI Common QUery InterfaCe). While the idea of this system, providing always up-to-date data from multiple resources with a single query, is great, it is still not really useful for actual data integration, despite the growing support in the community. The main reason is that while the data format and data access are well defined, MITAB files allow for a certain flexibility. For instance, some databases only use UniProtKB accession numbers to specify the interacting molecules, others use only RefSeq, yet others only Entrez Gene IDs. Since PSICQUIC simply indexes what it find in MITAB files, it can only serve what is present in the files. Using the PSICQUICview client, certain interactions will therefore be missed, depending on the query identifier system.

One big question mark behind iRefIndex is its uncertain future. Until spring 2013, there was a period of almost one and a half years, where the database was no longer updated. Given that the responsible Donaldson group stopped existing, such a period without updates can happen again anytime. For this reason, the InteractionAnnotator is designed and implemented in a flexible manner, such that it can be used with any other MITAB file, e.g., every file retrieved from the PSICQUIC service mentioned above. In the current implementation, the MITAB file is loaded and parsed at each startup, which allows a great feal of flexibility, but results in a significant delay of at least one minute. In order to speed up the tool, a later version might make use of (embedded) database or other caching solutions for temporarily storing the data.

Input

The InteractionAnnotator is a command-line tool that operates on tabular flat files. The primary use case for interaction data in our previous applications was detecting genes that interact with other certain genes, optionally limited to a particular subset of genes. Therefore, the only input that we support in this initial implementation is the query by gene or protein identifiers, specified via the –in parameter. Valid options for this parameter are entrez and uniprot, defining that the input is present as Entrez gene identifiers or UniProtKB accession numbers, respectively.

The tabular input file is specified with the –from-file parameter, the column in the input file containing the identifier is indicated by the -c parameter (note that the first column in the file has the number one, not a programming-style zero). It can also be specified that the standard input should be used instead of a file.

If the input has a header line, this can be specified via the -H option (not to be mistaken with the lowercase -h that will print the help).

Given that UniProt accession numbers are more widely supported in the interaction community than Entrez Gene identifiers, for some MITAB entries it will be the case that interaction records are only described using UniProtKB accession numbers. In order to make these accessible for queries from Entrez gene identifiers, the InteractionAnnotator can be started with identifier mapping support by using the –map-ids option. If this option is used, the InteractionAnnotator will try to map all UniProtKB accession numbers to Entrez gene identifiers using the Dinthor Ensembl gene and protein mappers. The mapping is not restricted to this direction, i.e., it can also be used if the interaction file uses primarily Entrez gene identifiers and the input file is in the UniProtKB identifier system.

Via the parameter –data-type, it can be specified, what sort of interaction data should be incorporated into the InteractionAnnotator. The three options self (both particiapting interactors are the same entity), ppi (binary protein-protein interactions involving exactly two different interacting entities), and complex (assemblies of three or more interacting entities) can be combined to select multiple types at once. For instance, if ppi and complex are used together, all protein complexes will be expanded and the resulting binary interactions will be treated like true binary interactions. In the case of complexes, matrix-expansion will used to created "false" binary interactions between each of the entities in the complex. For example, a complex with four proteins A,B,C,and D will result in six binary interactions A-B, A-C, A-D, B-C, B-D, and C-D. If the parameter is not used, ppi will be used by default.

The parameter –include-predictions can be set to specify if interactions, for which there is only evidence from computational predictions, should be included in the results. If the parameter is not used, computational predictions are by default excluded. In the current release, this parameter is only relevant for binary protein-protein interactions, as iRefIndex does not include complex predictions.

If the interacting genes should be filtered for those belonging to a particular set or panel of genes, this can be instructed via the parameter –panel-file that specifies the tabular text file that contains the panel genes. The parameter –panel-file-H can be added to indicate that the panel file has a header line that should be skipped. If the panel file contains multiple columns, the correct columns containing the IDs of the panel genes is indicated using the –panel-file-c parameter. The IDs in the panel file have to be from the same identifier system that is used in the input file (defined via the –in parameter).

If interactions are filtered using a gene panel, the option –p-value can be used to add information about the significance of the so-found interactions. P-values are computed using Fisher’s exact test, based on the total number of interactions, interactions involving all panel genes, and interactions involving the particular query gene.

The parameter –interactor will add the identifiers of the interacting entities, i.e, depending on the chosen input type, either Entrez gene IDs or UniProtKB accession numbers.

If the parameter –detection-method is added, an additional column with the experimental detection methods will be added.

The parameter -interaction-type will add a column containing the type of the interaction that is reported in the input. This is not to be mistaken with the –data-type parameter.

By using the –count parameter, a column summarizing the number of distinct interactions will be added.

The option –header-suffix can be used to add a particular string to the end of each column. This option can be practical if multiple runs of the InteractionAnnotator are performed, for instance, using different gene panels or different interaction types.

Additional options applicable to more than a single tool (such as –help, –data-version, –version, or –permissiveness) are summarized in common command line options.

Output

If the –print-stat parameter is used, the module will print some background information on the input file, such as the total number of protein-protein interactions or co-complexes.

In the normal use-case, where interaction information is added to a tabular input file via the –from-file, –in, and -c parameters, ach distinct interaction will be reported in a new line. The number of distinct interactions depends on the requested parameters/additional columns. As an example, take a gene that has only one known interaction partner, but where two independent publications have reported this interaction. If only the interactor column is requested (via the –interactor parameter), one interaction will be reported in the output. If via the –pubmed parameter additional literature information is requested, the tool would report two interactions and the number of lines in the output will be greater than the number of lines in the input.

By default, no new columns will be added to the input file; the user has to specify via parameters, which columns should to be added. Based on the selection, the output will contain additional columns labeled:

  • Interactor (flag –interactor)
  • P-value (flag –p-value)
  • Pubmed ID (flag –pubmed)
  • Detection method (flag –detection-method)
  • Interaction type (flag –interaction-type)
  • Confidence score (flag –confidence)
  • Source (flag –source)
  • Count (flag –count)
  • P-value (flag –p-value)

If the option –header-suffix is uses, the string that was provided to this option will be added to the end of each column.

If the –collapse flag has been set, the output wil be condensed into the input line, i.e., no additional lines will be added to the input. If multiple values are present as the result of this, they will be joined by a "|". Note that this can potentially result in very lengthy lines if hundreds of interactions are to be joined.

For example, if the input file is stored in /tmp/interaction.tsv as

EntrezID
22861

then the following command finds interacting proteins for the NLRP1 protein, which in this case is specified by a NCBI Entrez gene ID.

$ python InteractionAnnotator.py --in entrez -c 1 -H --interactor --from-file /tmp/interaction.tsv

The result file looks like this and we notice that the interactors are also reported as NCBI gene IDs.

EntrezID Interactor
22861    10392
22861    152138
22861    29108
22861    317
22861    335
22861    58484
22861    596
22861    598
22861    64127
22861    834
22861    835
22861    838
22861    842

References

[1]Razick S. et al. (2008) iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 9:405. doi: 10.1186/1471-2105-9-405. [PMID 18823568]

[2]Kerrien, S. et al. (2007) Broadening the horizon–level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 5:44. [PMID 17925023]