-
Notifications
You must be signed in to change notification settings - Fork 689
Description
Description
In Scanpy preprocessing workflows, there seems to be some ambiguity regarding the order of applying sc.pp.filter_genes (basic gene filtering based on min_cells/min_counts) and sc.pp.normalize_total (library size normalization).
If filter_genes is applied before normalize_total:
Removing lowly expressed genes changes the total counts per cell slightly.
This alters the normalization factors computed in normalize_total, which in turn affects the relative expression ratios of all remaining genes in each cell.
While the effect is often minor (especially if filtered genes have very low counts), it can propagate to downstream steps like HVG selection, scaling, PCA, clustering, and differential expression.
If filter_genes is applied after normalization, the normalization is performed on the full (unfiltered) gene set, preserving the original relative proportions more accurately.
Question
Is there a clearly recommended or standard order for these steps in the Scanpy preprocessing pipeline?
From reviewing the official tutorials and documentation:
Basic filtering (filter_cells and filter_genes) is typically done early, right after QC and before normalization.
Normalization follows filtering.
Highly variable gene (HVG) selection and further filtering (e.g., subsetting to HVGs) happens after normalization and log-transform.
However, given the subtle impact on per-cell proportions described above, it would be helpful to have explicit guidance or best-practice recommendation on whether basic filter_genes should strictly precede or could follow normalize_total.