Data Integrator
|
The ReactomeAnnotator is a command-line tool to access Reactome data. The following text is copied from the Reactome website, as it nicely describes the resource: "Reactome is a curated database of pathways and reactions (i.e. pathway steps) in human biology. The Reactome definition of a reaction includes many events in biology that are changes in state, such as binding, activation, translocation and degradation, in addition to classical biochemical reactions. Information in the database is authored by expert biologist researchers, maintained by Reactome editorial staff." A detailed description of Reactome, though largely focussed on its website, can be found here.
Pathways in Reactome are hierachical, that is, one pathway will usually contain several other pathways. For example, the highest-level pathways are currently some twenty entries like "Disease", "Apoptosis", or "Metabolism". Apoptosis is split into a number of sub-pathways like intrinsic apoptosis, which again are split into multiple, more specific pathways. The finest steps in the pathway hierarchy are the biochemical reactions, for example "FasL:Fas binds FADD". These reactions are defined in a way that they can participate in multiple pathways. Every item in Reactome is identified via a stable id of the form REACT_2191, followed by an optional version (e.g. REACT_2191.3). From the id it is not clear if it describes a protein, small molecule, pathway, etc., this is defined via the respective category. Reactome categories are, for example, Pathway, BiochemicalReaction, Protein, Complex, EntitySet, or Catalysis.
In addition to proteins (identified via their UniProtKB accession numbers), Reactome also contains information about molecules (category SmallMolecule, identified by their ChEBI identifiers), DNA/RNA (categories DNA and RNA), and other biological entities.
UniProt proteins may be represented by multiple entries in Reactome, as the same protein present in different cellular components receives different Reactome ids (e.g. REACT_3853 for A2M in the extracellular region and REACT_3710 for A2M in platelet alpha granule lumen). In addition, a reactome "protein" may not only represent an individual protein (as defined by UniProt), but may group multiple similar proteins (e.g. VAMP as a protein that describes VAMP2, VAMP7, and VAMP8. Normally, however, these sets would be defined via EntitySets, and not via reactome protein entries.
The ReactomeAnnotator is a command-line tool. Most options operate on tabular plain text files that are specified via the –from-file
parameter (which can refer to a file or standard input reading) and the -c
parameter, which defines the column in the input file/stream (numbering starts at 1). If the input has a header line, this can be specified via the -H
option.
The option –query-term
can be used to retrieve Reactome entries that contain certain query string(s). If multiple search strings are provided, each of them is separated by a space, e.g. –query-term cancer signalling
, they are by default used to further restrict the search (logical "AND", i.e., retrieve all terms which contain "cancer" and "signalling"). This default behavior can be changed by adding the –combine-terms
operand with the value "OR", which would return all terms that contain either "cancer" or "signalling"). A further restriction of the term search space can be done by using the –limit-category
option, which will restrict the results to those categories listed, for example, proteins ("Protein"), pathways ("Pathway"), or reactions ("BiochemicalReaction"). All possible categories can be seen in the help document (–help
).
–limit-category
can also be used in the other tool use cases, where the input is defined via UniProt (–in uniprot
) or Reactome identifiers (–in reactome
). Depending on the additional parameters, the tool will try to find the requested identifiers in the Reactome data and optionally extend the results with additional items higher or lower in the pathway hierarchy. For each of the resulting Reactome items, the Reactome identifier (–id
), name (–name
), category (–category
), or external database identifier (–xref
) can be requested.
If the input is in the Reactome identifier system (–in reactome
), the tool will always report the respective Reactome entry in the output, given that it is a valid entry present in the database. The Reactome identifier can be given either with or without version, but if the version is provided, it has to match the version in the database, i.e., REACT_1.1 as a query will not return REACT_1.2, while the query REACT_1 would return.
The results can be extended by adding the –parents
or –children
options. The additional entries reported by those options depend on the category of the Reactome entry that is used for querying. A protein or small molecule only has parents (e.g., reactions, complexes, or pathways) but no children, as it is the most specific level in the Reactome hierarchy. A pathway, on the other hand, will always have children (e.g. sub-pathways, reactions, proteins, or complexes) but might have no parents, if it is the most generic one (e.g. "Disease" or "Metabolism").
If the input is in the UniProtKB given as UniProt accession numbers –in uniprot
), the tool will report the respective Reactome protein entries, if they are known. By adding the –parents
option, the tool will also include all entries to which the protein is annotated, e.g. pathways, complexes, or reactions. As proteins have no further children, adding –children
to a –in uniprot
query will only report the protein itself.
Both input options (–in
) support the use of panel files containing identifiers to restrict the results. Comparable to normal input files, a panel file is defined via the –panel-file
parameter, –panel-file-H
can be added to indicate that the panel file has a header line that should be skipped. The column containing the identifiers in the panel file is indicated using the –panel-file-c
parameter (as with -c
, numbering starts at 1). In case a panel file is used, the tool will compare the identifiers defined in it with the identifiers reported in the columns that are requested via the –id
and –xref
parameters. This way, the ids in the panel file can either be Reactome ids (–id
) or external identifiers (–xref
) like UniProtKB accession numbers or ChEBI identifers.
By using the –count
parameter, a column summarizing the number of distinct results will be added.
The option –header-suffix
can be used to add a particular string to the end of each column. This option can be useful, if multiple runs of the tool are performed, for instance, using different panel files.
Additional options applicable to more than a single tool (such as –help/-h
, –data-version
, –version
, empty-cell
) are summarized in common command line options.
If the –query-terms
option has been used, the tool will report each matching term with its Reactome id, name, and category. A header line will be reported whenever results have been found. If no matching terms have been found, the output will consist of empty_cells.
If the tool is queried by UniProt accession numbers or Reactome identifiers (–in
), each result will be returned as a new line, where the number of results depends on the particular options that have been chosen. By default, no new columns will be added to the input file. The user has to specify those explicitly via parameters.
Based on the selection, the output will contain additional columns labeled:
Reactome ID
(option –id
) Reactome name
(option –name
) Reactome category
(option –category
) Xref
(option –xref
) Count
(option –count
) If the option –header-suffix
is used, the string that was provided to this option will be added to the end of each new column.
If the –collapse
option has been set, the output wil be condensed into the input line, i.e., no additional lines will be added to the input. If multiple values are present as the result of this, they will be joined by a "|". Note that this can potentially result in very lengthy lines.
Given the input file /tmp/reactom-in.tsv
,
Reactome Id Comment REACT_25.2 A reaction REACT_0 An invalid ID REACT_5744 A protein REACT_3449.2 A complex REACT_121006.2 A pathway REACT_14751.1 A control REACT_20239 An entitiy set
the command line
python ReactomeAnnotator.py --in reactome -c 1 -H --from-file /tmp/reactome-in.tsv --category --id --name --header-suffix " (New)"
will output the following lines:
Reactome Id Comment Reactome ID (New) Reactome Name (New) Reactome Category (New) REACT_25.2 A reaction REACT_25.2 kallikrein + alpha2-macroglobulin -> kallikrein:alpha2-macrogloulin BiochemicalReaction REACT_0 An invalid ID -- -- -- REACT_5744 A protein REACT_5744.2 ALB,Serum albumin precursor Protein REACT_3449.2 A complex REACT_3449.2 Alpha2-macroglobulin Complex REACT_121006.2 A pathway REACT_121006.2 Acyl chain remodeling of CL Pathway REACT_14751.1 A control REACT_14751.1 'alpha2-macroglobulin [extracellular region]' negatively regulates 'Conversion of pro-apoA-I to apoA-I' Control REACT_20239 An entitiy set REACT_20239.1 Vamp Protein
[1] Croft D., (2014) The Reactome pathway knowledgebase. Nucleic Acids Res. 42(Database issue):D472-7. doi: 10.1093/nar/gkt1102. [PMID 24243840]