Data Integrator
|
Ensembl regulatory features include a wealth of information and with this tool we try to capture both a broad view as well as a more detailed one on the data provided by Ensembl. In general, regulatory data is provided as intervals associated with regulatory properties. At the top level there is the MultiCell summary where Ensembl tries to condense information from individual underlying cell lines and regulatory elements. Each of the underlying cell lines again is split into many segments with regulatory properties such as transcriptional activity, promoter regions, or methylation patterns, some of them predicted, others based on experimental data.
The basic input type is a genomic coordinate. If one column is specified, it is used as a location to query for regulatory information. If two columns are given, they identify an interval which again is queried for regulatory information. A single column query may be extended by setting a search radius to retrieve interval-like data. In addition, a granularity level decides how much information is displayed. For coarse level information only, summaries are shown on the MultiCell level, fine level information goes down one level deeper and displays summaries on the cell-line level.
Options applicable to more than a single tool are summarized in common command line options.
Independent of the way of data retrieval, different columns are added based on the choice of the granularity level. If MultiCell-level data has been chosen, columns Human Ensembl Regulation ID, Human ENSR Start, Human ENSR End
and Human ENSR Categories
with information on the Ensembl regulatory element identifier (ENSR code), its begin and end genomic coordinate and the categories of the individual cell lines (which currently are given by Promoter associated, Gene associated, PolIII associated, Non-gene associated,
and Unclassified
. The region categories are condensed into their name and the number of times they are assigned to different cell lines, ordered by this number. Please be aware that the MultiCell itself is lacking a category, as it might contradict the individual cell line's assignments.
In case of high detail level information, the output consists of the three columns as in the previous MultiCell output, and then with columns named Human ENSR Cell Line, Human ENSR Category,
and Human ENSR Segmentation Classes
for the cell line name, its category (which is on of the categories listed as a summary in the MultiCell Human ENSR Categories
column mentioned above), and the by number of occurrences ordered list of evidence types consisting of transcription factors and histone modifications.
For example, Ensembl regulatory element ENSR00001348194 is located on chromosome 17, 46618590-46620119. Thus the input file /tmp/reg.tsv
GC_1 GC_2 GRCh37:17:46618591 GRCh37:17:46618600
describes two genomic coordinate that are located in this regulatory element. The command line call
$ perl gcoords2reg.pl -H -c 1 -r 10 --granularity 0 </tmp/reg.tsv
will output MultiCell regulatory data for the genomic coordinate found in column 1 looking 10 base pairs upstream and downstream from that position:
GC_1 GC_2 Human Ensembl Regulation ID Human ENSR Start Human ENSR End Human ENSR Categories GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 Promoter Associated (6),Unclassified (3),Gene Associated (2)
The same output is retrieved when querying an interval,
$ perl gcoords2reg.pl -H -c 1,2 -r 10 --granularity 0 </tmp/reg.tsv
whereas detail information is retrieved by
$ perl gcoords2reg.pl -H -c 1 -r 10 --granularity 1 </tmp/reg.tsv
and leads to the following, larger output:
GC_1 GC_2 Human Ensembl Regulation ID Human ENSR Start Human ENSR End Human ENSR Cell Line Human ENSR Category Human ENSR Segmentation Classes GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 CD4 Promoter Associated DNase1 (2),H2BK120ac (2),H2BK5ac (2),H3K27ac (2),H3K4me3 (2),H3K9ac (2),H4K5ac (2),H4K91ac (2),H2AZ (1),H3K18ac (1),H3K36ac (1),H3K4ac (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 GM12878 Promoter Associated H3K27ac (2),H3K9ac (2),H3K27me3 (1),H3K4me3 (1),Yy1 (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 H1ESC Promoter Associated H3K27me3 (4),DNase1 (3),H3K4me2 (1),Yy1 (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 HepG2 Unclassified H3K27me3 (2) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 HMEC Promoter Associated H3K27ac (3),DNase1 (2),H3K4me2 (2),H3K4me3 (2),H3K9ac (2) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 HSMM Promoter Associated H3K4me2 (2),H3K4me3 (2),H3K27ac (1),H3K36me3 (1),H3K9ac (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 HUVEC Gene Associated H3K4me3 (4),DNase1 (3),H3K27ac (2),H3K4me2 (2),H3K36me3 (1),H3K9ac (1),PolII (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 IMR90 Unclassified DNase1 (3),H3K4me2 (3),H3K18ac (2),H3K4ac (2),H3K56ac (2),H3K79me2 (2),H4K5ac (2),H4K91ac (2),H2AK5ac (1),H2BK120ac (1),H3K14ac (1),H3K27ac (1),H3K36me3 (1),H3K4me3 (1),H3K9ac (1),H4K8ac (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 K562 Unclassified DNase1 (4),PolII (4),H3K4me3 (3),H3K27ac (2),H3K4me2 (2),H3K79me2 (2),Yy1 (2),H3K36me3 (1),H3K9ac (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 NH-A Gene Associated DNase1 (2),H3K27ac (2),H3K4me3 (2),H3K27me3 (1),H3K36me3 (1) GRCh37:17:46618591 GRCh37:17:46618600 ENSR00001348194 GRCh37:17:46618590 GRCh37:17:46620119 NHEK Promoter Associated H3K4me2 (2),H3K9ac (2),H3K27me3 (1),H3K4me3 (1)