-
Notifications
You must be signed in to change notification settings - Fork 8
Options for renormalization during collect_metacells() #87
Description
First of all thank you for developing your tool and your well written vignettes.
I used your iterative vignette to obtain metacells for my dataset, with the intend of using these metacells in downstream analysis. However, these tools produced nonsensical outputs, because the gene expression values of the metacells were different than these tool expected (the standard normalization and log-transformation).
Based on the answer to the issue based below I figured out that during the collect metacells step you renormalize to a 'reads per 1' (as opposed to RPM or something like that), which is likely why my downstream analysis was nonsensical. Within your collect_metacells() function you offer choosing between geomean or linear fractions, but not as to if or how renormalization is performed. I therefore was looking through your sourcecode in metacells/pipeline/collect.py whether I could find where to disable/alter the renormalization to sum to 1 myself, but honestly I got a bit lost.
I was wondering if you are perhaps planning to offer this option in a future update, or alternatively if you could point me at the linenumbers in your source code where I could tweak it myself?
Thanks a million in advance!
Instead, we provide the total UMIs for each metacell and for each gene in each metacell, which you can use to compute the linear fraction of each gene out of the total. Such an estimate is wildly unreliable when applied to single cells, but the whole point of metacells is that they are just large enough to make this robust (and not too large so you can still see details of cell behavior within the "same" cell type).
We compute a fraction per metacell (by default using geomean, but there's a flag to force this back to linear fractions - see the documentation of the collect_metacells function). Geomean is less sensitive to one or two cells with very high expression dominating the results. This looks better on gene-gene plots, but because we renormalize these fractions to sum to 1, it is is essentially impossible to compare these normalized geomean fractions with UMI counts in single cells - for that you'd be better off with using the linear fractions. TBH we are re-evaluating the geomean approach...
Final note - because we (like everyone else doing scRNA-seq) are forced to use relative fractions of UMIs out of the total, the choice of "total" is crucial. In particular you can't meaningfully compare fractions between two data sets unless you ensure you use the same set of genes as the denominator in both (typically, choosing the common set of genes). This of course requires recomputing the fractions... For example, we do such a renormalization as one of the 1st steps in the projection algorithm (to make the atlas and query fractions comparable).
Originally posted by @orenbenkiki in #83