Data Integrator
gcoordsconservation - Sequence conservation status/GERP scores retriever

The tool gcoordsconservation gets sequence conservation status, GERP scores or GERP "Constrained Elements" for the specified genomic coordinates and supplied method/species set.

Several conservation methods are supported, and are divided into three different categories, which behave slightly differently:

  • EPO/EPO_LOW_COVERAGE/PECAN: these methods return a boolean (0 or 1) indicating if the region (or part of it) is conserved.
  • GERP Scores: these methods return a score. If a region has been supplied, the highest GERP score for the sequence is returned.
  • GERP Constrained Elements: these methods return a boolean (0 or 1) indicating if the region (or part of it) is overlapping a constrained element block.

For details on the conservation methods themselves, please consult the official Ensembl's Compara documentation.

This tool utilizes the Ensembl Perl API.

Input

The input consists of just one column for the genomic coordinate. The method/species-set which needs to be extracted must be supplied on the command line through the –mss option.

A list of all available methods/species-set can be obtained by passing –list-mss on the command line.

Options applicable to more than a single tool are summarized in common command line options.

Output

The sequence conservation status or score is output as a new column. If headers are being used, the column name has the same name as the method/species-set supplied on the command line.

Example

A list of all available methods/species-sets can be obtained by the following command:

$ ./gcoordsconservation.pl –data-version v-01 –list-mss

which results in the following list in Ensembl r100 (GRCh38):

103 eutherian mammals EPO-Low-Coverage
103 eutherian mammals GERP Conservation Scores
103 eutherian mammals GERP Constrained Elements 13 primates EPO
27 primates EPO-Low-Coverage
49 mammals EPO
81 amniota vertebrates GERP Conservation Scores
81 amniota vertebrates GERP Constrained Elements
81 amniota vertebrates Mercator-Pecan

Given the input file /tmp/gc.tsv

GC
GRCh38:1:18916968
GRCh38:1:18917308
GRCh38:14:72904497-73054722

the command

$ ./gcoordsconservation.pl -H –data-version v-01 -c 1 –mss '81 amniota vertebrates Mercator-Pecan' </tmp/gc.tsv

will produce the following output (using Ensembl r100 database):

GC                            81 amniota vertebrates Mercator-Pecan
GRCh38:1:18916968             1
GRCh38:1:18917308             0
GRCh38:14:72904497-73054722   1

The same input using the 103 eutherian mammals GERP Conservation Scores method/species-set instead returns the following GERP scores:

GC                             103 eutherian mammals GERP Conservation Scores
GRCh38:1:18916968              0.69
GRCh38:1:18917308              1.88
GRCh38:14:72904497-73054722    4.32

Warning

Please notice that GRCh37 conservation data are not available. Therefore, a command targetting GRCh37 data like

./gcoordsconservation.pl --data-version v-00 --list-mss

will result in a warning message as follows:

gcoordsconservation.pl WARN: Problem during initialization. Compara database missing?

and output NA in a single line. This value can be used as an -mss option and will lead to the same warning and no output at all, which will be the behavior for the Galaxy installation.