Data Integrator
|
The purpose of most Dintegrator
tools is to be integrated into Galaxy
. The "galaxy"
directory contains both the installation script (GalaxyInstall.py) and the Galaxy
tool definitions.
"GalaxyInstall.py" supports multiple versions of the Dintegrator
platform as well as multiple versions of each tool. The directory structure is used to enforce versioning for each tool, while the installation directory structure of Dintegrator
is enforced to support multiple concurrent versions of the entire platform.
We refer to the directory structure in "galaxy"
to be the source directory structure. The directory that will contain multiple Dintegrator
releases (and used by galaxy
itself) is the installation structure.
The source directory structure can be simply extracted from SVN
. The installation structure though requires manual setup.
The structure of the "galaxy"
directory works as follows:
"tools/"
: should contain only subdirectories."tools/ToolName/"
: Each subdirectory defines the name/id of the tool that is going to be installed. This folder should contain only the XML master definition (described later), and more subdirectories, each defining the name of a tool release."tools/ToolName/ToolName.xml"
: This is the tool XML master definition. The file name must match exactly the name of the parent directory."tools/ToolName/RELEASE/"
: Each subdirectory defines a new release of the tool."tools/ToolName/RELEASE/ToolName.xml"
: This is the specialized-release XML definition for the tool."utils/"
: This directory can contain any arbitrary file that is copied verbatim in the "galaxy/tools/TOOL_PREFIX"
folder along with the tools definition. It contains at least the "Dintegrator.sh"
wrapper.The master XML definition can contain any tag normally allowed in the Galaxy tool definition file: http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax
The main "<tool>"
tag only needs the "version"
attribute. Both name
and ID
are automatically inferred by the file/directory name.
If you define a "<command>"
tag, this command will be executed as a default both when no specialized XML command is defined for a release, or when no specialized XML command exists at all.
"<command>"
has always interpreter="sh"
set. Any command will by invoked by calling "utils/Dintegrator.sh"
, which runs the appropriate release of the tool and sets the required environment variables through "env.sh"
. Tools don't need to worry about this though, as the invocation is transparent. Simply give the full path of the command relative to the D-integrator's source root as the first argument.
Specialized XML definitions on the other hand have special treatment. The root node is always "<tool>"
, but no attribute is needed. Only the "<command>"
and "<inputs>"
subtags are allowed.
All the "<inputs>"
will be wrapped under a new conditional: the release. The release name/value itself is defined by the directory structure as outlined before. To access the value of the parameters of a specialized XML in the "<command>"
tag you need to prefix each variable with "$rel"
(such as if $rel.value == 'test': ...
).
The "<command>"
tag is optional. When defined, will override the main XML command when the specified tool release is selected.
The "<help>"
tag is optional. Any content in the specialized XML will be appended to the main "<help>"
contents. Multiple "<help>"
tags can coexist. Each tag can have the optional main_order="before"
attribute to have the content prepended to the main help. This is mostly useful to prepend tips or warnings that should appear at the top of the page.
As a design rule, only put the main "<input>"
and "<output>"
datasets in the master XML definition for maximum flexibility in the long term.
"<help>"
can be post-processed, by defining additional keyword-substitutions on the command line using the "\-\-subst-help"
flag. Normally, at least the "DOC_ROOT"
keyword should be set to the publicly accessible documentation for the D-integrator.
The following keywords are then available for use in the help:
Root of the D-integrator help
Prefix for the help of each tool (defaults to "/doc")
Suffix for the help of each tool (defaults to ".html")
Current tool ID
By executing:
a typical $DOC_URL would look like "https://galaxy.dm.eurac.edu/doc/doc/tool.html"
.
Help post-processing is currently used to reduce the number of places where documentation should lay. The "<help>"
tag should contain only help which is relevant to the Galaxy tool.
"GalaxyInstall.py" will refuse to continue if an undefined substitution is found anywhere in the help texts. At least "DOC_ROOT"
must be defined.
The following extra commands/attributes are supported in addition to the Galaxy
tool definition format:
The sanitize="no"
attribute can be supplied to <param type="text">
flags. This attributes disables any escaping done by default by galaxy so that the output variable contains the original string unadulterated. As a result, obviously, you should be careful in using this text on the command line, as it could contain shell escapes.
To supply such values on the command line, a common recipe is the following:
In addition to the normal attributes, from_exec=""
allows to expand the "<options>"
tag into a list of "<option>"
tags by calling an executable. The output is expected to be a tab-separated two-column output of IDs and Values respectively.
from_exec=""
is meant to expand any shell command.
If you're calling a D-integrator
script to provide an expansion for "<options>"
though, you should use from_tool=""
instead.
You can use the variable $DI_RELEASE
when expanding in release-specific parameters. This variable in undefined in the master XML expansion.
The variable $DI_PREFIX
is also guaranteed to be set to the D-integrator's installation root, in case it is needed.
from_exec=""
is meant to expand any shell command. For backward compatibility, the working directory of the command is "galaxy/" from within the specified release of the D-integrator's source. When expanding in the master XML, the CWD is still "galaxy/" but within the current release.
The command is expected to produce at least one line of output.
You can use the default=""
attribute to pre-select a value (it must match the first column of the output).
The expanded options are appended to pre-existing ones (if any).
from_tool=""
. Typically, calls to tools would be wrapped as "./utils/Dintegrator.sh $DI_RELEASE tool"
. This usage is strictly deprecated though. Just like from_exec=""
, from_tool=""
allows to expand the "<options>"
tag into a list of "<option>"
tags by calling a D-integrator tool. The output of the tool is expected to be a tab-separated two-column output of IDs and Values respectively.
Invocation works like the main "<command>"
tag, in that the command line must begin by the path of the tool relative to the D-integrator's source root and is automatically wrapped by calling "utils/Dintegrator.sh"
with the appropriate release set.
You can use the variable $DI_RELEASE
when expanding in release-specific parameters, though thanks to the invocation trampoline this is largely unneeded. As for from_exec=""
, the variable $DI_PREFIX
is also available and set to the installation's root.
The tool is expected to produce at least one line of output.
You can use the default=""
attribute to pre-select a value (it must match the first column of the output).
The expanded options are appended to pre-existing ones (if any).
Children of the "<inputs>"
tag ("<param>"
) in the master XML definition can contain the extra main_order=""
attribute. The value can be either "before" or "after" (the default).
By default, master XML parameters are appended after release-specific ones. By setting main_order="before"
, parameters can be moved before release-specific parameters.
This is required for data dependencies sometimes (especially for "data" parameters being referenced by subsequent "data_column" params).
While the source tree holds all the copies for the past Galaxy
tool main releases, the program themselves must be extracted manually from SVN.
The "utils/Dintegrator.sh"
script is responsible of calling the executables within the project, and expects the following directory structure:
"/"
: D-integrator installation root, containing a list of subdirectories named after each revision"/RELEASE/"
: each subdirectory should be named after the release, and contain the full copy from SVN extracted at the specific release.To install Dintegrator
from scratch, the following steps are necessary:
"/galaxy"
."/dintegrator/"
."/dintegrator/LATEST/"
"/dintegrator/RELEASE/"
."/dintegrator/RELEASE/data"
."src/python/cmd/GalaxyInstall.py --subst-help DOC_ROOT=url /dintegrator/LATEST /dintegrator/ /path/to/galaxy/dir"
.URL
should be configured by your web server, and point to the generated documentation inside "/dintegrator/LATEST/html/"
.
When starting the galaxy daemon, the environment variable "DI_PREFIX"
must point to the absolute path to the D-integrator's installation root ("/dintegrator"
in the previous example).
Remember also to install the various EnsEMBL API versions correctly, as described in EnsEMBL API installation.
Since Galaxy is driven by a web interface, automated testing is a bit more complicated than for command line tools. We offer a test suite which facilitates testing with Galaxy. In short, it is composed of one Galaxy workflow for Python modules and another for Perl modules.
An up to date version of Galaxy allows to upload multiple files at once. The files needed to run the unit test on Galaxy are copied to any desired directory with galaxy/utest/prepare.sh
script, distinguishing between Python and Perl unit tests by an argument. The files are then uploaded at once in Galaxy in the "Analyze data" tab with the "Load Data" Tool button. Best to specify "Tabular" file type. For later reuse, it is recommended to save the history.
Directory galaxy/utest
contains two unit test workflows, Unit_test_workflow_Perl.ga
and Unit_test_workflow_Python.ga
. Most likely they will contain a release version which is not the one to be tested, so we recommend to modify the release version to the desired YYYY-MM
version by a shell script command like
sed 's/2014-03/YYYY-MM/' Unit_test_workflow_Python.ga >up2date_workflow.ga
This (modified) workflow file is then loaded into Galaxy in the "Workflow" tab with "Upload or import workflow". Afterwards, we start it by clicking on the imported workflow and selecting "Run" from the context sensitive drop-down menu.
When running the workflow, many input datasets need to be specified. This is facilitated by a) the uploaded set of input files and b) the name of the input file, which is written next to the input dataset in the dataset form. There is a "type to filter" text box below the drop-down menu for the input dataset. It can be used to restrict by name the suggestions in the drop-down menu, so choosing the dataset name printed above will automatically select the correct input dataset. It should be mentioned that a few test cases need two input data sets, for example the TableJoiner - Join two unsorted tables based on a common column tool.
After specifying all input datasets, the test is started with the "Run workflow" button at the bottom of the page. It's a good idea to check "Copy results into new history" such that the input dataset is not polluted with the results from the unit test (see also the description on loading files as input datasets).
The output and log files already have the names of the corresponding command line unit test file names, which is very convenient for automated comparison. Get these output files from the "History" pane, "Export to file File". Depending on your browser, this may be a gzipped file which contains a compressed tar file. Use this latter as input for an automated comparison script,
compare.sh [python|perl] galaxy_export.tar
which names the unit test files and tells if a diff
command succeeded or failed.
This section addresses rather developers of the Data Integrator suite, so it is just a list of bullet points one needs to consider when creating a new test case in the Galaxy framework.
galaxy/utest/prepare.sh
file and add the input file name. test_NNN
.