Data Integrator
ReactomeAnnotator - Retrieve curated Reactome data

The ReactomeAnnotator is a command-line tool to access Reactome data. The following text is copied from the Reactome website, as it nicely describes the resource: "Reactome is a curated database of pathways and reactions (i.e. pathway steps) in human biology. The Reactome definition of a reaction includes many events in biology that are changes in state, such as binding, activation, translocation and degradation, in addition to classical biochemical reactions. Information in the database is authored by expert biologist researchers, maintained by Reactome editorial staff." A detailed description of Reactome, though largely focussed on its website, can be found here.

Pathways in Reactome are hierachical, that is, one pathway will usually contain several other pathways. For example, the highest-level pathways are currently some twenty entries like "Disease", "Apoptosis", or "Metabolism". Apoptosis is split into a number of sub-pathways like intrinsic apoptosis, which again are split into multiple, more specific pathways. The finest steps in the pathway hierarchy are the biochemical reactions, for example "FasL:Fas binds FADD". These reactions are defined in a way that they can participate in multiple pathways. Every item in Reactome is identified via a stable id of the form REACT_2191, followed by an optional version (e.g. REACT_2191.3). From the id it is not clear if it describes a protein, small molecule, pathway, etc., this is defined via the respective category. Reactome categories are, for example, Pathway, BiochemicalReaction, Protein, Complex, EntitySet, or Catalysis.

In addition to proteins (identified via their UniProtKB accession numbers), Reactome also contains information about molecules (category SmallMolecule, identified by their ChEBI identifiers), DNA/RNA (categories DNA and RNA), and other biological entities.

UniProt proteins may be represented by multiple entries in Reactome, as the same protein present in different cellular components receives different Reactome ids (e.g. REACT_3853 for A2M in the extracellular region and REACT_3710 for A2M in platelet alpha granule lumen). In addition, a reactome "protein" may not only represent an individual protein (as defined by UniProt), but may group multiple similar proteins (e.g. VAMP as a protein that describes VAMP2, VAMP7, and VAMP8. Normally, however, these sets would be defined via EntitySets, and not via reactome protein entries.

Input

The ReactomeAnnotator is a command-line tool. Most options operate on tabular plain text files that are specified via the –from-file parameter (which can refer to a file or standard input reading) and the -c parameter, which defines the column in the input file/stream (numbering starts at 1). If the input has a header line, this can be specified via the -H option.

The option –query-term can be used to retrieve Reactome entries that contain certain query string(s). If multiple search strings are provided, each of them is separated by a space, e.g. –query-term cancer signalling, they are by default used to further restrict the search (logical "AND", i.e., retrieve all terms which contain "cancer" and "signalling"). This default behavior can be changed by adding the –combine-terms operand with the value "OR", which would return all terms that contain either "cancer" or "signalling"). A further restriction of the term search space can be done by using the –limit-category option, which will restrict the results to those categories listed, for example, proteins ("Protein"), pathways ("Pathway"), or reactions ("BiochemicalReaction"). All possible categories can be seen in the help document (–help).

–limit-category can also be used in the other tool use cases, where the input is defined via UniProt (–in uniprot) or Reactome identifiers (–in reactome). Depending on the additional parameters, the tool will try to find the requested identifiers in the Reactome data and optionally extend the results with additional items higher or lower in the pathway hierarchy. For each of the resulting Reactome items, the Reactome identifier (–id), name (–name), category (–category), or external database identifier (–xref) can be requested.

If the input is in the Reactome identifier system (–in reactome), the tool will always report the respective Reactome entry in the output, given that it is a valid entry present in the database. The Reactome identifier can be given either with or without version, but if the version is provided, it has to match the version in the database, i.e., REACT_1.1 as a query will not return REACT_1.2, while the query REACT_1 would return.

The results can be extended by adding the –parents or –children options. The additional entries reported by those options depend on the category of the Reactome entry that is used for querying. A protein or small molecule only has parents (e.g., reactions, complexes, or pathways) but no children, as it is the most specific level in the Reactome hierarchy. A pathway, on the other hand, will always have children (e.g. sub-pathways, reactions, proteins, or complexes) but might have no parents, if it is the most generic one (e.g. "Disease" or "Metabolism").

If the input is in the UniProtKB given as UniProt accession numbers –in uniprot), the tool will report the respective Reactome protein entries, if they are known. By adding the –parents option, the tool will also include all entries to which the protein is annotated, e.g. pathways, complexes, or reactions. As proteins have no further children, adding –children to a –in uniprot query will only report the protein itself.

Both input options (–in) support the use of panel files containing identifiers to restrict the results. Comparable to normal input files, a panel file is defined via the –panel-file parameter, –panel-file-H can be added to indicate that the panel file has a header line that should be skipped. The column containing the identifiers in the panel file is indicated using the –panel-file-c parameter (as with -c, numbering starts at 1). In case a panel file is used, the tool will compare the identifiers defined in it with the identifiers reported in the columns that are requested via the –id and –xref parameters. This way, the ids in the panel file can either be Reactome ids (–id) or external identifiers (–xref) like UniProtKB accession numbers or ChEBI identifers.

By using the –count parameter, a column summarizing the number of distinct results will be added.

The option –header-suffix can be used to add a particular string to the end of each column. This option can be useful, if multiple runs of the tool are performed, for instance, using different panel files.

Additional options applicable to more than a single tool (such as –help/-h, –data-version, –version, empty-cell) are summarized in common command line options.

Output

If the –query-terms option has been used, the tool will report each matching term with its Reactome id, name, and category. A header line will be reported whenever results have been found. If no matching terms have been found, the output will consist of empty_cells.

If the tool is queried by UniProt accession numbers or Reactome identifiers (–in), each result will be returned as a new line, where the number of results depends on the particular options that have been chosen. By default, no new columns will be added to the input file. The user has to specify those explicitly via parameters.

Based on the selection, the output will contain additional columns labeled:

  • Reactome ID (option –id)
  • Reactome name (option –name)
  • Reactome category (option –category)
  • Xref (option –xref)
  • Count (option –count)

If the option –header-suffix is used, the string that was provided to this option will be added to the end of each new column.

If the –collapse option has been set, the output wil be condensed into the input line, i.e., no additional lines will be added to the input. If multiple values are present as the result of this, they will be joined by a "|". Note that this can potentially result in very lengthy lines.

Given the input file /tmp/reactom-in.tsv,

Reactome Id    Comment
REACT_25.2     A reaction
REACT_0        An invalid ID
REACT_5744     A protein
REACT_3449.2   A complex
REACT_121006.2 A pathway
REACT_14751.1  A control
REACT_20239    An entitiy set

the command line

python ReactomeAnnotator.py --in reactome -c 1 -H --from-file /tmp/reactome-in.tsv --category --id --name --header-suffix " (New)"

will output the following lines:

Reactome Id    Comment        Reactome ID (New) Reactome Name (New)                                                                                     Reactome Category (New)
REACT_25.2     A reaction     REACT_25.2        kallikrein + alpha2-macroglobulin -> kallikrein:alpha2-macrogloulin                                     BiochemicalReaction
REACT_0        An invalid ID  --                --                                                                                                      --
REACT_5744     A protein      REACT_5744.2      ALB,Serum albumin precursor                                                                             Protein
REACT_3449.2   A complex      REACT_3449.2      Alpha2-macroglobulin                                                                                    Complex
REACT_121006.2 A pathway      REACT_121006.2    Acyl chain remodeling of CL                                                                             Pathway
REACT_14751.1  A control      REACT_14751.1     'alpha2-macroglobulin [extracellular region]' negatively regulates 'Conversion of pro-apoA-I to apoA-I' Control
REACT_20239    An entitiy set REACT_20239.1     Vamp                                                                                                    Protein

References

[1] Croft D., (2014) The Reactome pathway knowledgebase. Nucleic Acids Res. 42(Database issue):D472-7. doi: 10.1093/nar/gkt1102. [PMID 24243840]