Data Integrator
Galaxy integration

The purpose of most Dintegrator tools is to be integrated into Galaxy. The "galaxy" directory contains both the installation script (GalaxyInstall.py) and the Galaxy tool definitions.

"GalaxyInstall.py" supports multiple versions of the Dintegrator platform as well as multiple versions of each tool. The directory structure is used to enforce versioning for each tool, while the installation directory structure of Dintegrator is enforced to support multiple concurrent versions of the entire platform.

We refer to the directory structure in "galaxy" to be the source directory structure. The directory that will contain multiple Dintegrator releases (and used by galaxy itself) is the installation structure.

The source directory structure can be simply extracted from SVN. The installation structure though requires manual setup.

Source directory layout

The structure of the "galaxy" directory works as follows:

  • "tools/": should contain only subdirectories.
  • "tools/ToolName/": Each subdirectory defines the name/id of the tool that is going to be installed. This folder should contain only the XML master definition (described later), and more subdirectories, each defining the name of a tool release.
  • "tools/ToolName/ToolName.xml": This is the tool XML master definition. The file name must match exactly the name of the parent directory.
  • "tools/ToolName/RELEASE/": Each subdirectory defines a new release of the tool.
  • "tools/ToolName/RELEASE/ToolName.xml": This is the specialized-release XML definition for the tool.
  • "utils/": This directory can contain any arbitrary file that is copied verbatim in the "galaxy/tools/TOOL_PREFIX" folder along with the tools definition. It contains at least the "Dintegrator.sh" wrapper.

The master XML definition can contain any tag normally allowed in the Galaxy tool definition file: http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax

The main "<tool>" tag only needs the "version" attribute. Both name and ID are automatically inferred by the file/directory name.

If you define a "<command>" tag, this command will be executed as a default both when no specialized XML command is defined for a release, or when no specialized XML command exists at all.

"<command>" has always interpreter="sh" set. Any command will by invoked by calling "utils/Dintegrator.sh", which runs the appropriate release of the tool and sets the required environment variables through "env.sh". Tools don't need to worry about this though, as the invocation is transparent. Simply give the full path of the command relative to the D-integrator's source root as the first argument.

Specialized XML definitions on the other hand have special treatment. The root node is always "<tool>", but no attribute is needed. Only the "<command>" and "<inputs>" subtags are allowed.

All the "<inputs>" will be wrapped under a new conditional: the release. The release name/value itself is defined by the directory structure as outlined before. To access the value of the parameters of a specialized XML in the "<command>" tag you need to prefix each variable with "$rel" (such as if $rel.value == 'test': ...).

The "<command>" tag is optional. When defined, will override the main XML command when the specified tool release is selected.

The "<help>" tag is optional. Any content in the specialized XML will be appended to the main "<help>" contents. Multiple "<help>" tags can coexist. Each tag can have the optional main_order="before" attribute to have the content prepended to the main help. This is mostly useful to prepend tips or warnings that should appear at the top of the page.

As a design rule, only put the main "<input>" and "<output>" datasets in the master XML definition for maximum flexibility in the long term.

Help post-processing

"<help>" can be post-processed, by defining additional keyword-substitutions on the command line using the "\-\-subst-help" flag. Normally, at least the "DOC_ROOT" keyword should be set to the publicly accessible documentation for the D-integrator.

The following keywords are then available for use in the help:

$DOC_ROOT

Root of the D-integrator help

$DOC_PREFIX

Prefix for the help of each tool (defaults to "/doc")

$DOC_SUFFIX

Suffix for the help of each tool (defaults to ".html")

$DOC_TOOL

Current tool ID

$DOC_URL
ROOT + PREFIX + '/' + TOOL + SUFFIX

By executing:

GalaxyInstall.py --subst-help "DOC_ROOT=https://galaxy.gm.eurac.edu/doc" ...

a typical $DOC_URL would look like "https://galaxy.dm.eurac.edu/doc/doc/tool.html".

Help post-processing is currently used to reduce the number of places where documentation should lay. The "<help>" tag should contain only help which is relevant to the Galaxy tool.

"GalaxyInstall.py" will refuse to continue if an undefined substitution is found anywhere in the help texts. At least "DOC_ROOT" must be defined.

Extended XML attributes in command definitions

The following extra commands/attributes are supported in addition to the Galaxy tool definition format:

<param type="text" sanitize="no"/>:

The sanitize="no" attribute can be supplied to <param type="text"> flags. This attributes disables any escaping done by default by galaxy so that the output variable contains the original string unadulterated. As a result, obviously, you should be careful in using this text on the command line, as it could contain shell escapes.

To supply such values on the command line, a common recipe is the following:

#set var = str($rel.var).replace("'", "'\\''")
command '$var'

<options from_exec=""/>:

In addition to the normal attributes, from_exec="" allows to expand the "<options>" tag into a list of "<option>" tags by calling an executable. The output is expected to be a tab-separated two-column output of IDs and Values respectively.

from_exec="" is meant to expand any shell command.

If you're calling a D-integrator script to provide an expansion for "<options>" though, you should use from_tool="" instead.

You can use the variable $DI_RELEASE when expanding in release-specific parameters. This variable in undefined in the master XML expansion.

The variable $DI_PREFIX is also guaranteed to be set to the D-integrator's installation root, in case it is needed.

from_exec="" is meant to expand any shell command. For backward compatibility, the working directory of the command is "galaxy/" from within the specified release of the D-integrator's source. When expanding in the master XML, the CWD is still "galaxy/" but within the current release.

The command is expected to produce at least one line of output.

You can use the default="" attribute to pre-select a value (it must match the first column of the output).

The expanded options are appended to pre-existing ones (if any).

Remarks
In old code (up to release 2012-08), the "./utils/Dintegrator.sh" script was used directly instead of using from_tool="". Typically, calls to tools would be wrapped as "./utils/Dintegrator.sh $DI_RELEASE tool". This usage is strictly deprecated though.
<options from_tool=""/>:

Just like from_exec="", from_tool="" allows to expand the "<options>" tag into a list of "<option>" tags by calling a D-integrator tool. The output of the tool is expected to be a tab-separated two-column output of IDs and Values respectively.

Invocation works like the main "<command>" tag, in that the command line must begin by the path of the tool relative to the D-integrator's source root and is automatically wrapped by calling "utils/Dintegrator.sh" with the appropriate release set.

You can use the variable $DI_RELEASE when expanding in release-specific parameters, though thanks to the invocation trampoline this is largely unneeded. As for from_exec="", the variable $DI_PREFIX is also available and set to the installation's root.

The tool is expected to produce at least one line of output.

You can use the default="" attribute to pre-select a value (it must match the first column of the output).

The expanded options are appended to pre-existing ones (if any).

<param main_order=""/>:

Children of the "<inputs>" tag ("<param>") in the master XML definition can contain the extra main_order="" attribute. The value can be either "before" or "after" (the default).

By default, master XML parameters are appended after release-specific ones. By setting main_order="before", parameters can be moved before release-specific parameters.

This is required for data dependencies sometimes (especially for "data" parameters being referenced by subsequent "data_column" params).

Dintegrator installation layout

While the source tree holds all the copies for the past Galaxy tool main releases, the program themselves must be extracted manually from SVN.

The "utils/Dintegrator.sh" script is responsible of calling the executables within the project, and expects the following directory structure:

  • "/": D-integrator installation root, containing a list of subdirectories named after each revision
  • "/RELEASE/": each subdirectory should be named after the release, and contain the full copy from SVN extracted at the specific release.

Dintegrator installation

To install Dintegrator from scratch, the following steps are necessary:

  1. Extract the galaxy source into a directory called "/galaxy".
  2. Customize/test the galaxy installation as needed.
  3. Create a directory to hold the D-integrator's installation root: "/dintegrator/".
  4. Extract the most current sources into "/dintegrator/LATEST/"
  5. Extract past releases into "/dintegrator/RELEASE/".
  6. Copy the relevant (release related) data files into "/dintegrator/RELEASE/data".
  7. run "src/python/cmd/GalaxyInstall.py --subst-help DOC_ROOT=url /dintegrator/LATEST /dintegrator/ /path/to/galaxy/dir".

URL should be configured by your web server, and point to the generated documentation inside "/dintegrator/LATEST/html/".

When starting the galaxy daemon, the environment variable "DI_PREFIX" must point to the absolute path to the D-integrator's installation root ("/dintegrator" in the previous example).

Remember also to install the various EnsEMBL API versions correctly, as described in EnsEMBL API installation.

Galaxy unit testing

Since Galaxy is driven by a web interface, automated testing is a bit more complicated than for command line tools. We offer a test suite which facilitates testing with Galaxy. In short, it is composed of one Galaxy workflow for Python modules and another for Perl modules.

Running a unit test on Galaxy

An up to date version of Galaxy allows to upload multiple files at once. The files needed to run the unit test on Galaxy are copied to any desired directory with galaxy/utest/prepare.sh script, distinguishing between Python and Perl unit tests by an argument. The files are then uploaded at once in Galaxy in the "Analyze data" tab with the "Load Data" Tool button. Best to specify "Tabular" file type. For later reuse, it is recommended to save the history.

Directory galaxy/utest contains two unit test workflows, Unit_test_workflow_Perl.ga and Unit_test_workflow_Python.ga. Most likely they will contain a release version which is not the one to be tested, so we recommend to modify the release version to the desired YYYY-MM version by a shell script command like

sed 's/2014-03/YYYY-MM/' Unit_test_workflow_Python.ga >up2date_workflow.ga

This (modified) workflow file is then loaded into Galaxy in the "Workflow" tab with "Upload or import workflow". Afterwards, we start it by clicking on the imported workflow and selecting "Run" from the context sensitive drop-down menu.

When running the workflow, many input datasets need to be specified. This is facilitated by a) the uploaded set of input files and b) the name of the input file, which is written next to the input dataset in the dataset form. There is a "type to filter" text box below the drop-down menu for the input dataset. It can be used to restrict by name the suggestions in the drop-down menu, so choosing the dataset name printed above will automatically select the correct input dataset. It should be mentioned that a few test cases need two input data sets, for example the TableJoiner - Join two unsorted tables based on a common column tool.

After specifying all input datasets, the test is started with the "Run workflow" button at the bottom of the page. It's a good idea to check "Copy results into new history" such that the input dataset is not polluted with the results from the unit test (see also the description on loading files as input datasets).

The output and log files already have the names of the corresponding command line unit test file names, which is very convenient for automated comparison. Get these output files from the "History" pane, "Export to file File". Depending on your browser, this may be a gzipped file which contains a compressed tar file. Use this latter as input for an automated comparison script,

compare.sh [python|perl] galaxy_export.tar

which names the unit test files and tells if a diff command succeeded or failed.

Implementing a new test case

This section addresses rather developers of the Data Integrator suite, so it is just a list of bullet points one needs to consider when creating a new test case in the Galaxy framework.

  • Choose a simple unit test case from the command line unit test suite. This test case has an input file and an output file, they will be used later.
  • Edit galaxy/utest/prepare.sh file and add the input file name.
  • First, check that the test runs correctly in Galaxy with the input file and the parameters set.
  • Then add it to either Python or Perl unit test workflow.
    • Right click the workflow name, "Edit".
    • Click on the "Tool" pane to create the to be tested tool in the workflow.
    • In the "Details" pane, set the parameters for the tool, rename the output (and if it exists, the log) file to the unit test case output file name. Also annotate ("Annotation/Notes") the tool with the test case number, ie. test_NNN.
    • An input file is added through the "Tools" pane, "Workflow control" -> "Inputs" -> "Input dataset". Connect the input and the unit test case. Rename the input dataset to the unit test input file name.
Authors
Yuri D'Elia
Chris X. Weichenberger
Date
2014-03-31