Data Integrator
gcoords2reg - Retrieval of Ensembl regulatory information

Ensembl regulatory features include a wealth of information and with this tool we try to capture both a broad view as well as a more detailed one on the data provided by Ensembl. In general, regulatory data is provided as intervals associated with regulatory properties. At the top level there is the MultiCell summary where Ensembl tries to condense information from individual underlying cell lines and regulatory elements. Each of the underlying cell lines again is split into many segments with regulatory properties such as transcriptional activity, promoter regions, or methylation patterns, some of them predicted, others based on experimental data.

Input

The basic input type is a genomic coordinate. If one column is specified, it is used as a location to query for regulatory information. If two columns are given, they identify an interval which again is queried for regulatory information. A single column query may be extended by setting a search radius to retrieve interval-like data. In addition, a granularity level decides how much information is displayed. For coarse level information only, summaries are shown on the MultiCell level, fine level information goes down one level deeper and displays summaries on the cell-line level.

Options applicable to more than a single tool are summarized in common command line options.

Output

Independent of the way of data retrieval, different columns are added based on the choice of the granularity level. If MultiCell-level data has been chosen, columns Human Ensembl Regulation ID, Human ENSR Start, Human ENSR End and Human ENSR Categories with information on the Ensembl regulatory element identifier (ENSR code), its begin and end genomic coordinate and the categories of the individual cell lines (which currently are given by Promoter associated, Gene associated, PolIII associated, Non-gene associated, and Unclassified. The region categories are condensed into their name and the number of times they are assigned to different cell lines, ordered by this number. Please be aware that the MultiCell itself is lacking a category, as it might contradict the individual cell line's assignments.

In case of high detail level information, the output consists of the three columns as in the previous MultiCell output, and then with columns named Human ENSR Cell Line, Human ENSR Category, and Human ENSR Segmentation Classes for the cell line name, its category (which is on of the categories listed as a summary in the MultiCell Human ENSR Categories column mentioned above), and the by number of occurrences ordered list of evidence types consisting of transcription factors and histone modifications.

For example, Ensembl regulatory element ENSR00001348194 is located on chromosome 17, 46618590-46620119. Thus the input file /tmp/reg.tsv

GC_1                GC_2
GRCh37:17:46618591  GRCh37:17:46618600

describes two genomic coordinate that are located in this regulatory element. The command line call

$ perl gcoords2reg.pl -H -c 1 -r 10 --granularity 0 </tmp/reg.tsv

will output MultiCell regulatory data for the genomic coordinate found in column 1 looking 10 base pairs upstream and downstream from that position:

GC_1               GC_2               Human Ensembl Regulation ID Human ENSR Start   Human ENSR End     Human ENSR Categories
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 Promoter Associated (6),Unclassified (3),Gene Associated (2)

The same output is retrieved when querying an interval,

$ perl gcoords2reg.pl -H -c 1,2 -r 10 --granularity 0 </tmp/reg.tsv

whereas detail information is retrieved by

$ perl gcoords2reg.pl -H -c 1 -r 10 --granularity 1 </tmp/reg.tsv

and leads to the following, larger output:

GC_1               GC_2               Human Ensembl Regulation ID Human ENSR Start   Human ENSR End     Human ENSR Cell Line Human ENSR Category Human ENSR Segmentation Classes
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 CD4                  Promoter Associated DNase1 (2),H2BK120ac (2),H2BK5ac (2),H3K27ac (2),H3K4me3 (2),H3K9ac (2),H4K5ac (2),H4K91ac (2),H2AZ (1),H3K18ac (1),H3K36ac (1),H3K4ac (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 GM12878              Promoter Associated H3K27ac (2),H3K9ac (2),H3K27me3 (1),H3K4me3 (1),Yy1 (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 H1ESC                Promoter Associated H3K27me3 (4),DNase1 (3),H3K4me2 (1),Yy1 (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 HepG2                Unclassified        H3K27me3 (2)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 HMEC                 Promoter Associated H3K27ac (3),DNase1 (2),H3K4me2 (2),H3K4me3 (2),H3K9ac (2)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 HSMM                 Promoter Associated H3K4me2 (2),H3K4me3 (2),H3K27ac (1),H3K36me3 (1),H3K9ac (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 HUVEC                Gene Associated     H3K4me3 (4),DNase1 (3),H3K27ac (2),H3K4me2 (2),H3K36me3 (1),H3K9ac (1),PolII (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 IMR90                Unclassified        DNase1 (3),H3K4me2 (3),H3K18ac (2),H3K4ac (2),H3K56ac (2),H3K79me2 (2),H4K5ac (2),H4K91ac (2),H2AK5ac (1),H2BK120ac (1),H3K14ac (1),H3K27ac (1),H3K36me3 (1),H3K4me3 (1),H3K9ac (1),H4K8ac (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 K562                 Unclassified        DNase1 (4),PolII (4),H3K4me3 (3),H3K27ac (2),H3K4me2 (2),H3K79me2 (2),Yy1 (2),H3K36me3 (1),H3K9ac (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 NH-A                 Gene Associated     DNase1 (2),H3K27ac (2),H3K4me3 (2),H3K27me3 (1),H3K36me3 (1)
GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194             GRCh37:17:46618590 GRCh37:17:46620119 NHEK                 Promoter Associated H3K4me2 (2),H3K9ac (2),H3K27me3 (1),H3K4me3 (1)