Data Integrator
|
One of the major drawbacks of the Unix join
command is that both files need to be sorted by their common join column. In addition, the join column is always put in the first row and it is omitted after then. All these limitations and the desire to have a tool at hands that is able to perform an intersection operation led to the tool described in this document.
The TableJoiner
tool joins to files, the basefile
and the joinfile
, by finding matching lines for a basefile
column in joinfile
. This operation is asymmetric, since the joinfile
is read into memory and then it is used to look up data from the basefile
join column line by line. The tool is able to print unpaired lines from the basefile
, only print a subset of the columns available in joinfile
, and implements a summary mode where only the number of matching lines from joinfile
are output.
The tool is limited by the size of joinfile
, as it is read into memory. This is the cost to be paid for unsorted columns.
Minimal input for the TableJoiner tool are the two files that need to be joined, basefile
and joinfile
, and for each of these files, the common column. Since this tool represents an improvement of the Unix join
command, options have been named like in the Unix sibling.
Optionally, it is possible to print lines from basefile
where no corresponding entry is found in file joinfile
. In this case, all joined columns contain the empty cell string. A separate column with a fixed string can be added to the output for easily identifying the source of the join. (This is especially helpful when consecutively joining multiple files.) Alternatively, it is possible to modify the header of the joined column such that a fixed string is a added to the column header(s). Normally, all columns of joinfile
lines are appended as new columns to the lines from basefile
. In this case, the join column itself will be output twice, once in basefile
, the other time in joinfile
. However, only a subset of the columns in joinfile
can be printed in the joined table, where then the join column can be omitted. Finally, instead of joining rows from the joinfile
, only the number of matching rows can be output. Here, columns are simply called Join lines count
.
Options applicable to more than a single tool are summarized in common command line options.
The following examples illustrate usage and output of this tool. Input files are given below:
/tmp/base.tsv
:
Item Value a 1 b 2 c 3
/tmp/join.tsv
:
Item Comment a 1' a 1'' c 3'
Representative examples are provided below:
chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 Item Value Item Comment a 1 a 1' a 1 a 1'' c 3 c 3' chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 Item Value Comment a 1 1' a 1 1'' c 3 3' chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)' Item Value Comment(join) a 1 1' a 1 1'' c 3 3' chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)' -a Item Value Comment(join) a 1 1' a 1 1'' b 2 -- c 3 3' chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)' -a --summary Item Value Join lines count(join) a 1 2 b 2 0 c 3 1