Data Integrator
HSGeneAtlas - GNF Gene Atlas based gene expression assignment.

Tissue specific whole genome gene expression was investigated in 2004 by Su et al. utilizing microarrays [1]. The resulting datasets were made publicly accessible for download and in the BioGPS web service and have become widely used. Compared to other genome wide microarray experiments deposited with GEO or ArrayExpress, data are present as consistently normalized, tissue specific measurements from a single experiment.

Gene Atlas dataset is available in two different flavors: Raw data includes expression data for each microarray measurement, whereas averaged data presents mean values for expression values per probe set and tissue. (Gene expression levels have been measured twice for most tissues, but there are few exceptions where it has been sampled more than twice.) We are using raw data and perform tissue-based averaging as a preprocessing step.

In order to make the distribution of gene expression values more normal distribution-like, we have taken the natural logarithm of all measurements and computed sample mean and sample variance for each of the 79 tissues available in the data set. Each measurement was then transformed to a z-score using the standard formula involving sample mean and sample variance. Data distribution still shows a rather long tail, with z-scores up to +8.25, but only down to -2.69.

Gene expression data provided by BioGPS is divided into two data sets. One is the standard Affymetrix HG-U133A chip and the other is a GNF custom chip produced by Affymetrix. As there is very good support for the HG-U133A chip, there is only limited supported for the custom chip array. Affymetrix provide updated probe set to transcript mappings for their chip, but the custom chip from GNF is delivered with mapping data from the year 2007. Therefore, there may be some information loss for the GNF probe sets, whose identifiers start with gnf1h.

Input

The input for this tool is rather simple. It consists of a single column with NCBI Entrez Gene identifiers.

Optionally, several filters are availble for coarsely sifting the output:

  • By default, all available tissues are listed in the output. A complete list of tissues is shown below. Any subset of these tissues can be chosen for restricted output.

  • It is possible to selected overexpressed, underexpressed, or over- and underexpressed genes. Since data has been normalized to z-scores, a value below 0.0 signifies underexpression, and a value above 0.0 signifies overexpression. Be aware that the distribution has a rather long tail (overexpression), but a rather short range for underexpressed genes. No restrictions are set by default, all expression values are output.

  • The Affymetrix HG-U133A chip array has mainly three types of probe sets, designated at, s_at, and x_at (ordered by specificity). All probe sets act as anti sense (at) targets, and ideally target a single transcript. If multiple transcripts of a single gene are targeted, the suffix s_at is used and if ambiguity arises from multiple targeted genes, the probe set is designated x_at. See Affymetrix' documentation for more information.

    In this framework, we use these suffixes for restricting to specific subsets of probe sets, listed below starting with the smallest subset:

    • at: Use only the single transcript targeting probe sets (ie. only at probe set types).
    • s_at: Use both single and multiple transcript (for the same gene) targeting probe sets (ie. both at and s_at probe set types).
    • x_at: Also include ambiguously mapped probe sets (ie. all three available probe set types at, s_at, and x_at).

If this parameter is omitted, all probe sets are output for a gene.

This is the list of available tissues, please be aware that these are case sensitive when used as input parameters:

721_B_lymphoblasts Adipocyte AdrenalCortex Adrenalgland Amygdala
Appendix AtrioventricularNode BDCA4+_DentriticCells Bonemarrow BronchialEpithelialCells
CD105+_Endothelial CD14+_Monocytes CD19+_BCells(neg._sel.) CD33+_Myeloid CD34+CD4+_Tcells
CD56+_NKCells CD71+_EarlyErythroid CD8+_Tcells CardiacMyocytes Caudatenucleus
Cerebellum CerebellumPeduncles CiliaryGanglion CingulateCortex Colorectaladenocarcinoma
DorsalRootGanglion FetalThyroid Fetalbrain Fetalliver Fetallung
GlobusPallidus Heart Hypothalamus Kidney Leukemia_chronicMyelogenousK-562
Leukemia_promyelocytic-HL-60 Leukemialymphoblastic(MOLT-4) Liver Lung Lymphnode
Lymphoma_burkitts(Daudi) Lymphoma_burkitts(Raji) MedullaOblongata OccipitalLobe OlfactoryBulb
Ovary Pancreas PancreaticIslet ParietalLobe Pituitary
Placenta Pons PrefrontalCortex Prostate Salivarygland
SkeletalMuscle Skin SmoothMuscle Spinalcord SubthalamicNucleus
SuperiorCervicalGanglion TemporalLobe Testis TestisGermCell TestisIntersitial
TestisLeydigCell TestisSeminiferousTubule Thalamus Thymus Thyroid
Tongue Tonsil Trachea TrigeminalGanglion Uterus
UterusCorpus WholeBlood Wholebrain colon pineal_day
pineal_night retina small_intestine

Options applicable to more than a single tool are summarized in common command line options.

Output

Technically, a probe set is a small set of distinct 25bp oligonucleotides, which target a short region overlapping with a gene, ideally restricting to a single transcript of a gene. Affymetrix provides translation tables for probe set IDs to affected transcripts and genes, and as GNF data files are a) not updated and b) have at most a single transcript listed in their probe set hit list (even for s_at probe sets), we provide the link only on the gene level.

The output consists of the probe set ID, Gene Atlas Probe Set ID, and a variable number of columns according to selected tissues. These columns show the normalized gene expression values as z-scores or the empty cell identifier if the threshold filter was not passed for this particular probe set. If filters were selected and header output was switched on, the respective headers will have additional information on the filter type and value. For a probe set filter, the text filter=type will be added, where type is the type of filter that has been applied. If a threshold has been selected, the text th=value is appended to the header, where value is a signed float identifying the type of gene expression filter selected: + for overexpressed genes, - for underexpressed genes and +- for both over- and underexpressed genes.

Now, a single gene may be represented by multiple probe sets. For example, adducin 3 (gamma) gene, Entrez ID 120, is targeted by probe sets 201034_at, 201752_s_at, 201753_s_at, and 205882_x_at. Or, calponin 2, Entrez ID 1265, is targeted only by probe set 201605_x_at. Depending on the filters described in the input section, some probe set IDs may not be output at all. Also, if the threshold filter has been set to a value such that no tissue can satisfy it, the probe set will be skipped. If filtering removes all probe sets for a gene, empty cell identifiers will be output for all columns, including the probe set ID column. Generally, filtering by thresholds will remove cells, those are then indicated by an empty cell.

Output example

For example, if the following data is stored in file /tmp/geneatlas.tsv:

EntrezID  GeneNm
120       ADD3
1265      CNN2

Filtering for over- and underexpressed genes with expression values higher than 0.5, filtering for probe sets that target at most multiple transcripts of the same gene, and restricting to the tissues tongue, amygdala and heart is achieved by the command

$ python HSGeneAtlas.py -H --tissues Tongue Amygdala Heart --probeset-filter s_at --threshold 0.5 --expression-filter both -c 1 /tmp/geneatlas.tsv

and outputs

EntrezID  GeneNm    Gene Atlas Probe Set ID  Tongue (th=+-0.50,filter=s_at)  Amygdala (th=+-0.50,filter=s_at)  Heart (th=+-0.50,filter=s_at)
120       ADD3      201034_at                1.37                            3.29                              --
120       ADD3      201752_s_at              1.67                            2.37                              -0.61
120       ADD3      201753_s_at              1.61                            3.09                              1.21
1265      CNN2      --                       --                              --                                --

Notice, that due to the filters, probe sets x_at are not displayed. Especially for gene CNN2 (calponin 2), where we have only the probe set 201605_x_at, no probe set qualified to pass the filters. Therefore, the probe set ID and all other tissue columns are filled with empty cell identifiers. Also, by applying the expression filter for both over- and underexpressed genes, absolute values higher than 0.5 are listed in the output. This especially applies to underexpressed probeset 201752_s_at in heart.

References

[1] Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 101(16):6062-7. [PMID: 15075390]