Parse any tabular text file (csv, tab delimited, excel exported text files) and output the result back as a simple non-quoted, non-escaped, tab-separated and cleaned file that can be easily parsed by most unix text tools, especially by our own set of tools which are based on tab-separated files.

Input

With this tool a table-like text file can be transformed into a table with well defined and easier parsable rows, ie. a clean version of the table. It aims to read heavily Excel-accentuated files, which inlucde a plethora of special cases that feature entries hampering automated processing, for example new lines in cells. The tool is able to read a text file with any separation character, and each cell can be delimited by a pre-defined string character, such that everything between the start and the end of this character will be treated as the cell's content, including a set of white-space characters, that optionally can be removed from a cell (eg, tab and newline characters). It is also possible to trim cell content from whitespace characters and to remove empty lines (and even empty columns) before the line is printed to the output, please refer to the examples below.

If needed, in a second step cells with multiple entries can be split into rows by using tblsubsplit - Split cells by a non-column separator character.

Options applicable to more than a single tool are summarized in common command line options.

Output

The main objective of this tool is to generate tab-separated files without special characters and escape characters. Nevertheless, column seperation and cell delimiter characters can be specified for the output file.

An Excel CSV file is very likely to look like this:

transformant id,construct id,cg number,chromosome nr,status,fb numbers,synonyms
2615,856,CG17262,2,available,"FBgn0064188,
FBgn0031499,
FBgn0243513","Cnir,
Cni-related,
cnir,
cornichon related"
2620,866,CG3327,2,available,"FBgn0020445,
FBgn0031524,
FBgn0031525","CG15410,
E23,
Early gene at 23"

This however does not look like a table, at least not viewed as above. But if we show the white space characters and still indicate line breaks by separate lines we are able to grasp the table-like structure of the file:

transformant id,construct id,cg number,chromosome nr,status,fb numbers,synonyms
2615,856,CG17262,2,available,"FBgn0064188,\nFBgn0031499,\nFBgn0243513","Cnir\nCni-related,\ncnir,\ncornichon related"
2620,866,CG3327,2,available,"FBgn0020445,\nFBgn0031524,\nFBgn0031525","CG15410,\nE23,\nEarly gene at 23"

HINT: In order to avoid dealing with whitespace characters in the text file, remove them already in your spreadsheet software (Excel or OpenOffice, for example) with a regular expression search and replace.

Examples

Example: Dealing with Excel-like CSV files

Let's assume this is stored in file excel-hell.csv, then the command

chris-cmd$ ./tbl2tbl.pl -d , -q \" -c '\n' -w /tmp/excel-hell.csv

will generate a much more behaved table. Parameters were chosen such that

the column input separator is a command -d ,
the cell quoting character is a quote -q \"
newlines should be removed -c '\n' (this is the Unix version, it may be that on Windows another character must be entered here.)
and whitespace within the cell should be removed.

transformant id

construct id

cg number

chromosome nr

status

fb numbers

synonyms

2615

856

CG17262

2

available

FBgn0064188,FBgn0031499,FBgn0243513

Cnir,Cni-related,cnir,cornichon related

2620

866

CG3327

2

available

FBgn0020445,FBgn0031524,FBgn0031525

CG15410,E23,Early gene at 23

Example: Removing empty lines and columns

This tool can also be used to get rid of empty lines, it even is able to detect empty columns and remove them upon request. If for example file /tmp/empty-lines.tsv is given by

hdr1    hdr2    hdr3    hdr4    hdr5
1       2       3               5
a       b       c               d

A       B       C               E
?       :       ;               -



this    is      the             end

the command

chris-cmd$ ./tbl2tbl.pl -H -r /tmp/empty-lines.tsv

will remove lines without any information, basically empty lines, which frequently appear at the end of a file. The output looks as shown below.

hdr1    hdr2    hdr3    hdr4    hdr5
1       2       3               5
a       b       c               d
A       B       C               E
?       :       ;               -
this    is      the             end

Now, we also want to get rid of column hdr4, which does not contain any valuable information. This is done by issuing the command

chris-cmd$ ./tbl2tbl.pl -H -r -R /tmp/empty-lines.tsv

and presents us with this output:

hdr1    hdr2    hdr3    hdr5
1       2       3       5
a       b       c       d
A       B       C       E
?       :       ;       -
this    is      the     end

Please note that even though there was a header entry (hdr4), the column was deleted since the data cells were all empty.