Data Integrator
TableJoiner - Join two unsorted tables based on a common column

One of the major drawbacks of the Unix join command is that both files need to be sorted by their common join column. In addition, the join column is always put in the first row and it is omitted after then. All these limitations and the desire to have a tool at hands that is able to perform an intersection operation led to the tool described in this document.

The TableJoiner tool joins to files, the basefile and the joinfile, by finding matching lines for a basefile column in joinfile. This operation is asymmetric, since the joinfile is read into memory and then it is used to look up data from the basefile join column line by line. The tool is able to print unpaired lines from the basefile, only print a subset of the columns available in joinfile, and implements a summary mode where only the number of matching lines from joinfile are output.

The tool is limited by the size of joinfile, as it is read into memory. This is the cost to be paid for unsorted columns.

Input

Minimal input for the TableJoiner tool are the two files that need to be joined, basefile and joinfile, and for each of these files, the common column. Since this tool represents an improvement of the Unix join command, options have been named like in the Unix sibling.

Optionally, it is possible to print lines from basefile where no corresponding entry is found in file joinfile. In this case, all joined columns contain the empty cell string. A separate column with a fixed string can be added to the output for easily identifying the source of the join. (This is especially helpful when consecutively joining multiple files.) Alternatively, it is possible to modify the header of the joined column such that a fixed string is a added to the column header(s). Normally, all columns of joinfile lines are appended as new columns to the lines from basefile. In this case, the join column itself will be output twice, once in basefile, the other time in joinfile. However, only a subset of the columns in joinfile can be printed in the joined table, where then the join column can be omitted. Finally, instead of joining rows from the joinfile, only the number of matching rows can be output. Here, columns are simply called Join lines count.

Options applicable to more than a single tool are summarized in common command line options.

Output examples

The following examples illustrate usage and output of this tool. Input files are given below:

/tmp/base.tsv:

Item    Value
a       1
b       2
c       3

/tmp/join.tsv :

Item    Comment
a       1'
a       1''
c       3'

Representative examples are provided below:

chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1
Item    Value   Item    Comment
a       1       a       1'
a       1       a       1''
c       3       c       3'
chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2
Item    Value   Comment
a       1       1'
a       1       1''
c       3       3'
chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)'
Item    Value   Comment(join)
a       1       1'
a       1       1''
c       3       3'
chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)' -a
Item    Value   Comment(join)
a       1       1'
a       1       1''
b       2       --
c       3       3'
chris-cmd$ ./TableJoiner.py -H --basefile /tmp/base.tsv --joinfile /tmp/join.tsv -1 1 -2 1 2 --header-suffix '(join)' -a --summary
Item    Value   Join lines count(join)
a       1       2
b       2       0
c       3       1