Skip to content

Recommended order for sc.pp.filter_genes and sc.pp.normalize_total in preprocessing pipeline #3925

@baicai-ai

Description

@baicai-ai

Description
In Scanpy preprocessing workflows, there seems to be some ambiguity regarding the order of applying sc.pp.filter_genes (basic gene filtering based on min_cells/min_counts) and sc.pp.normalize_total (library size normalization).
If filter_genes is applied before normalize_total:

Removing lowly expressed genes changes the total counts per cell slightly.
This alters the normalization factors computed in normalize_total, which in turn affects the relative expression ratios of all remaining genes in each cell.
While the effect is often minor (especially if filtered genes have very low counts), it can propagate to downstream steps like HVG selection, scaling, PCA, clustering, and differential expression.

If filter_genes is applied after normalization, the normalization is performed on the full (unfiltered) gene set, preserving the original relative proportions more accurately.
Question
Is there a clearly recommended or standard order for these steps in the Scanpy preprocessing pipeline?
From reviewing the official tutorials and documentation:

Basic filtering (filter_cells and filter_genes) is typically done early, right after QC and before normalization.
Normalization follows filtering.
Highly variable gene (HVG) selection and further filtering (e.g., subsetting to HVGs) happens after normalization and log-transform.

However, given the subtle impact on per-cell proportions described above, it would be helpful to have explicit guidance or best-practice recommendation on whether basic filter_genes should strictly precede or could follow normalize_total.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Triage 🩺This issue needs to be triaged by a maintainer

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions