Data Integrator
common command line options

In most cases, tools share command line options, which are summarized in this list:

  • -h, --help Show an extended help text for the respective tool and exit with error code 0 (no error).

  • --version Print the tool's version number and exit (exit code 0, no error.)

  • -H This option indicates that each input file contains a header row. All tools then add header columns to the output, too. When having multiple files for input, it is impossible to mix files with and without headers.

  • -c Specify the input column(s). There is a slight difference between tools implemented in Perl and those implemented in Python. For technical reasons, Perl can accept multiple column numbers only as a comma separated list without white space characters, eg. -c 4,10,15. The option -c 4, 10, 15 is incorrect for Perl command line tools. However, in case of Python tools, multiple columns are given by -c 4 10 15.

  • --data-version Since releases are tied to data, this option allows to select which data set shall be used in case there are multiple data sets for a given release. Data sets are numbered v-00, v-01, ..., and data set v-00 indicates the default data set which is available in any case. See file data-sources-release-versions.ods for a listing of all data sets that are available in the respective releases.

  • --empty-cell A cell can be empty, but from former experience with Excel and OpenOffice import of tabular separated files, it can be inconvenient to have completely blank cells. (Eg. when exporting a table from Excel into a csv file, trailing empty cells are not output such that the table is no longer a N x M matrix, but has a varying number of columns.) Common to all tools is that there are two types of empty cells in input files: Cells with only white space or no characters at all, and cells with the empty cell identifier, which defaults to --. This option allows to set a non-default empty cell identifier. The Perl tools parsing tabular data allow to set multiple empty-cell identifiers (by repeating the --empty-cell flag). The first specified identifier will also be used for output, whereas all the others only apply for input.

  • --permissiveness, -p This option is available in order to choose the action to perform in case the input file contains incorrect rows. User can choose between three options: echo : print the line in question without any further processing and issue a warning message. skip : do not print the line in question, so it is removed from the output; a warning message is written. stop : if an erroneous line is encountered, the processing is stopped with an error message and an exit code of 1.

  • --ensembl-hosts Use another server for retrieving Ensembl related online queries. This only applies to the set of Perl tools that run queries in real time, each of them mentions in a some way that it uses the Ensembl Perl API.

  • --in Specify input format of the column referred to by -c.

  • --out Specify output format of the column to be added. If there are multiple columns to be added, they are usually specified by options.

  • --list-in List possible input formats if a tool offers multiple ways of input. This option is mainly used by the Galaxy installer to automatically generate a list of possible options in a dropbox.

  • --list-out List possible output formats if a tool offers multiple ways of input. See issue above.

  • --np Perl tools parsing tabular data automatically write to the standard error some progress information and estimated completion time, which can be disabled by this flag. The progress information is only output when the tool is run from a terminal (it is automatically disabled for batch jobs). Progress information is only accurate when --preload is also used (a rough estimate based on file size is used otherwise).

  • --preload Perl tools parsing tabular data read and process the input line-by-line when possible. This behavior can be changed by suppling --preload, which loads the input file into memory and checks it for consistency before doing anything else. As a result of pre-loading, accurate progress status and completion estimation is also shown (which is only approximated otherwise). The file is automatically preloaded when it is sensible to do so (when input is a file and the file is smaller than 16MB).

  • -v Perl tools accept a -v or --verbose flag. When specified, these tools will increase verbosity. By default, only warnings are shown and are preceded by "WARN". A single -v flag will show informations (preceded by "INFO:"), which include connection status, input dataset sizes, command execution times, etc. A second -v flag will also show debugging traces (shown as "DEBUG:"). These include EnsEMBL re/connections, auto_retry activity, and more. GalaxyInstall will show progress over each tool/release with a single flag, and also tool expansion with a second one.

It is important to mention that DIntegrator does not change the content of tables, but adds data. Therefore, empty cells are used to indicate that there are no data available for this specific row in this specific column. Each tool will then handle this information based on its specific goal. For example, if an input column contains an empty cell as defined above, in most cases empty cells will be output for this row, simply because there is no information available.