Data Integrator
|
The MetaRanker [1] approach can be used to obtain a single score for a set of genes that has been scored by multiple, different methods, independent of the type of analysis applied in each of the methods. Scores and genes are inseparably connected to each other in that a gene has either a score value or it does not have any at all. It follows that each scoring method will yield its own set of genes that associate with scores. Let G = {g1, g2, ..., gM} the set of M genes that are scored by N different methods. Let c denote the column number, which corresponds to the c-th method that has been used to score the set of genes G. The input therefore consists of N columns and M rows, and since each scoring method results in its own subset of scored genes, we assume for the c-th method to have Nc <= M genes scored. Now let rgi,c be the rank of gene gi in method c. If the gene has not been scored by method c, it automatically receives the lowest rank, Nc. In addition, each method can be weighted by a number wc > 0. The MetaRanker score Si for gene gi is then given by
or, in log space
Peculiarities of this formula are discussed in the input section.
Mandatory input is a reference column of (gene) identifiers and at least one other column containing the scores (ie. numbers) or empty cell values, which indicate that the respective gene has not been considered by the scoring scheme. It should be taken care that gene identifiers, ie. the reference column entries, are unique. In case an identifier appears more than once, it will be counted only once and the score from the first row with this identifier will be selected from its score column accompanied by a warning message. However, each instance of such a multiply appearing identifier will be output with the calculated scores. If a score cannot be converted to a number, a warning message will be issued and the gene will be treated as if an empty cell has been supplied (that is, it receives the lowest rank).
Each of the score-containing columns may have its own sort order and must be specified. Sorting, either ascending or descending, determines the rank of the scored gene. Sticking to the above nomenclature, there are Nc genes with scores in column c. Identical scores (eg. for ordinal data, or numbers which are internally rounded to the fourth decimal place) will be given the same rank, but the next non-identical score will be ranked with a gap corresponding to the number of previous entries with the same score. This can be compared with ex-aequo placement in competitions. (For example, the scores are sorted ascending and yield the series 11, 12, 13, 13, 13, 20, 22. The ranks will then be assigned 1, 2, 3, 3, 3, 6, 7, respectively.) If very small numbers form the basis of scores (eg. p-values), we recommend to transform these data to a logarithmic scale, since internal rounding may chop off relevant information.
In addition, each score column can be furnished with a weight, which is by default set to one. Weights wc should be chosen such that the fraction in the above mentioned product formula remains between 0 and 1.
Internally, each input column is sorted, ranked, and normalized. If desired, these value can be output, see Output. Please note that contrary to the original publication, the last ranked entry does not receive normalized score 1.0, but slightly less. By this, we obtain additional ranking power with columns containing empty_cells, as the last ranked entry and the empty cell receive different scores.
Options applicable to more than a single tool are summarized in common command line options.
The output always consists of at least one column, Meta Rank Score
with in parenthesis the name of the reference column. Due to the nature of the score computation, very small numbers between 0 and 1 are common. Therefore, we have chosen to report the natural logarithm of the score, so the MetaRanker score indicates higher rank for more negative (ie. lower) scores.
If chosen, also the unweighted, normalized scores are reported for each column. In terms of the above mentioned MetaRanker score computation, each column then contains the value of rgi,c / Nc. The column header, if output, is termed Rank Score
and is appended in parenthesis the column name this rank score refers to, and its weight.
Let us assume the input file is stored in /tmp/meta.tsv
,
RowNr Gene Score1 Score2 Score3 Score4 Score5 0 A 10 500.123 0.500 0.125 0.125 1 A 10 500.123 0.500 0.125 0.125 2 B 20 -1.1 .000 0.75 0.125 3 C 30 -- 3.3 2 3.000 4 D 40 -- -- 2 3.000 5 E -- 66.6 2.2 2 3.000 6 F -- 0 -2.2 2 3.000 7 G -- 100 0 1.000 3.000 8 G -- 100 0 1.000 3.000 9 H -- -- -1.1 1.000 3.000 10 I -- -- -- -- -- 11 J -- -- -- -- -- 12 K -- -- -- -- -- 13 -- -- -- -- -- --
Please note that this example includes all special cases discussed here:
Gene A
occurs twice, but the scores are identical, therefore no warning is issued.
Genes are not always assigned a score and genes with scores assigned vary between columns.
Different sorting schemes have to be applied in order to compute the ranks.
Genes I
, J
, and K
are not assigned any scores at all.
Identical scores are attributed to different genes in for example column Score3
.
A bogus column with the same score for all genes except genes I
, J
, and K
is given by column Score4
.
The command line
$ python MetaRanker.py -H --score-columns 3a 4a 5d 6d 7a --column-weights 3.0 5.5 7.7 9 11.11 --single-score-columns -c 2 /tmp/meta.tsv
tells the MetaRanker module that columns 3, 4, 5, 6, and 7 contain scores and their weights are 3, 5.5, 7.7, 9, and 11.11, respectively. Columns 5 and 6 are sorted descending, whereas columns 3, 4, and 7 are sorted ascending to obtain the ranking. The reference column is given by column number 2. This command will result in the following output:
RowNr Gene Score1 Score2 Score3 Score4 Score5 Rank Score (Score1, w=3.00) Rank Score (Score2, w=5.50) Rank Score (Score3, w=7.70) Rank Score (Score4, w=9.00) Rank Score (Score5, w=11.11) Meta Rank Score (Gene) 0 A 10 500.123 0.500 0.125 0.125 0.200000000 0.833333333 0.375000000 0.888888889 0.111111111 -38.854679923 1 A 10 500.123 0.500 0.125 0.125 0.200000000 0.833333333 0.375000000 0.888888889 0.111111111 -38.854679923 2 B 20 -1.1 .000 0.75 0.125 0.400000000 0.166666667 0.500000000 0.777777778 0.111111111 -44.613777475 3 C 30 -- 3.3 2 3.000 0.600000000 1.000000000 0.125000000 0.111111111 0.333333333 -49.524780465 4 D 40 -- -- 2 3.000 0.800000000 1.000000000 1.000000000 0.111111111 0.333333333 -32.650034377 5 E -- 66.6 2.2 2 3.000 1.000000000 0.500000000 0.250000000 0.111111111 0.333333333 -46.467379797 6 F -- 0 -2.2 2 3.000 1.000000000 0.333333333 0.875000000 0.111111111 0.333333333 -39.051163034 7 G -- 100 0 1.000 3.000 1.000000000 0.666666667 0.500000000 0.555555556 0.333333333 -25.062953896 8 G -- 100 0 1.000 3.000 1.000000000 0.666666667 0.500000000 0.555555556 0.333333333 -25.062953896 9 H -- -- -1.1 1.000 3.000 1.000000000 1.000000000 0.750000000 0.555555556 0.333333333 -19.710814469 10 I -- -- -- -- -- 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 0.000000000 11 J -- -- -- -- -- 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 0.000000000 12 K -- -- -- -- -- 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 0.000000000 13 -- -- -- -- -- -- -- -- -- -- -- --
[1] Pers, TH. et al. (2011) Meta-analysis of heterogeneous data sources for genome-scale identification of risk genes in complex phenotypes. Genet Epidemiol. 35, 318-332. [PMID 21484861]