Data Integrator
|
Parse any tabular text file (csv, tab delimited, excel exported text files) and output the result back as a simple non-quoted, non-escaped, tab-separated and cleaned file that can be easily parsed by most unix text tools, especially by our own set of tools which are based on tab-separated files.
With this tool a table-like text file can be transformed into a table with well defined and easier parsable rows, ie. a clean version of the table. It aims to read heavily Excel-accentuated files, which inlucde a plethora of special cases that feature entries hampering automated processing, for example new lines in cells. The tool is able to read a text file with any separation character, and each cell can be delimited by a pre-defined string character, such that everything between the start and the end of this character will be treated as the cell's content, including a set of white-space characters, that optionally can be removed from a cell (eg, tab and newline characters). It is also possible to trim cell content from whitespace characters and to remove empty lines (and even empty columns) before the line is printed to the output, please refer to the examples below.
If needed, in a second step cells with multiple entries can be split into rows by using tblsubsplit - Split cells by a non-column separator character.
Options applicable to more than a single tool are summarized in common command line options.
The main objective of this tool is to generate tab-separated files without special characters and escape characters. Nevertheless, column seperation and cell delimiter characters can be specified for the output file.
An Excel CSV file is very likely to look like this:
transformant id,construct id,cg number,chromosome nr,status,fb numbers,synonyms 2615,856,CG17262,2,available,"FBgn0064188, FBgn0031499, FBgn0243513","Cnir, Cni-related, cnir, cornichon related" 2620,866,CG3327,2,available,"FBgn0020445, FBgn0031524, FBgn0031525","CG15410, E23, Early gene at 23"
This however does not look like a table, at least not viewed as above. But if we show the white space characters and still indicate line breaks by separate lines we are able to grasp the table-like structure of the file:
transformant id,construct id,cg number,chromosome nr,status,fb numbers,synonyms
2615,856,CG17262,2,available,"FBgn0064188,\nFBgn0031499,\nFBgn0243513","Cnir\nCni-related,\ncnir,\ncornichon related"
2620,866,CG3327,2,available,"FBgn0020445,\nFBgn0031524,\nFBgn0031525","CG15410,\nE23,\nEarly gene at 23"
HINT: In order to avoid dealing with whitespace characters in the text file, remove them already in your spreadsheet software (Excel or OpenOffice, for example) with a regular expression search and replace.
Let's assume this is stored in file excel-hell.csv
, then the command
chris-cmd$ ./tbl2tbl.pl -d , -q \" -c '\n' -w /tmp/excel-hell.csv
will generate a much more behaved table. Parameters were chosen such that
-d ,
-q \"
-c '\n'
(this is the Unix version, it may be that on Windows another character must be entered here.) transformant id | construct id | cg number | chromosome nr | status | fb numbers | synonyms |
---|---|---|---|---|---|---|
2615 | 856 | CG17262 | 2 | available | FBgn0064188,FBgn0031499,FBgn0243513 | Cnir,Cni-related,cnir,cornichon related |
2620 | 866 | CG3327 | 2 | available | FBgn0020445,FBgn0031524,FBgn0031525 | CG15410,E23,Early gene at 23 |
This tool can also be used to get rid of empty lines, it even is able to detect empty columns and remove them upon request. If for example file /tmp/empty-lines.tsv
is given by
hdr1 hdr2 hdr3 hdr4 hdr5 1 2 3 5 a b c d A B C E ? : ; - this is the end
the command
chris-cmd$ ./tbl2tbl.pl -H -r /tmp/empty-lines.tsv
will remove lines without any information, basically empty lines, which frequently appear at the end of a file. The output looks as shown below.
hdr1 hdr2 hdr3 hdr4 hdr5 1 2 3 5 a b c d A B C E ? : ; - this is the end
Now, we also want to get rid of column hdr4
, which does not contain any valuable information. This is done by issuing the command
chris-cmd$ ./tbl2tbl.pl -H -r -R /tmp/empty-lines.tsv
and presents us with this output:
hdr1 hdr2 hdr3 hdr5 1 2 3 5 a b c d A B C E ? : ; - this is the end
Please note that even though there was a header entry (hdr4
), the column was deleted since the data cells were all empty.