Skip to content

Batch Key not working with sc.pp.scrublet() #3837

@zekepuli

Description

@zekepuli

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

I'm trying to predict my doublets using scrublet. So, I'm using the pp.scrublet function to do that. The thing is, when I run it without adding the parameter batch_key, it works. But is failing when I try to use that parameter.

I tried these options before submitting the Issue:

  • Checked for NaN in original Anndata and Anndata adata_after_qc
  • Checked for NaN in Anndata with simulated doublets adata_sim
  • It's not a problem on my .obs because I already used the batch effect correction in other Scanpy functions and it worked
  • Ran the function with and without the parameter and only works without the parameter
  • Run the simulation within each batch independently

I think it might be a problem managing the batches when you have the combination of pre-simulated doublets using sc.pp.scrublet_simulate_doublets() + original Anndata + batch_key because that the output of sc.pp.scrublet_simulate_doublets() removes the .obs in the Anndata output.

Minimal code sample

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues

import scanpy as sc

# TODO: create `adata_after_qc`
# Simulate an AnnData similar to the one in question
n_obs, n_vars = 4151, 8342
dtype = np.int64

# Generate a sparse random count matrix (Poisson-distributed)
X = sp.csr_matrix(np.random.poisson(1, size=(n_obs, n_vars)), dtype=dtype)

adata_dummy = ad.AnnData(X)

# Example metadata
adata_dummy.obs["batch"] = np.random.choice(["r1", "r2", "r3", "r4"], size=n_obs)
adata_dummy.var["gene_symbol"] = [f"gene_{i}" for i in range(n_vars)]

# Store raw counts layer
adata_dummy.layers["raw_counts"] = adata_dummy.X.copy()


# Select the gene that I will modify to have zero variance
zero_var_gene_idx = 13
zero_var_gene_name = adata_dummy.var_names[zero_var_gene_idx]

# Get indices of cells belonging to batch 'r1'
batch_mask = adata_dummy.obs["batch"] == "r1"
r1_idx = np.where(batch_mask)[0]

# Set that gene's values in r1 to a constant
X_dense = adata_dummy.X.toarray()
X_dense[r1_idx, zero_var_gene_idx] = 2  # Constant value of gene

# Convert back to csr sparse matrix
adata_dummy.X = sp.csr_matrix(X_dense)

adata_sim = sc.pp.scrublet_simulate_doublets(
    adata_dummy, 
    synthetic_doublet_umi_subsampling = 1.0, # 1.0 == doublet is created adding counts from two random samples 
    layer="raw_counts"
)

sc.pp.scrublet(
    adata_dummy, # Original anndata 
    adata_sim=adata_sim, # Anndata with simulated doublets and modified .X 
    knn_dist_metric="euclidean",
    n_prin_comps=20,
    batch_key= "batch"
)

Error output

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Versions

marimo	0.15.2
scanpy	1.11.4
anndata	0.12.2
seaborn	0.13.2
numpy	2.2.6
scipy	1.16.1
pooch	1.8.2 (v1.8.2)
pandas	2.3.2
----	----
h5py	3.14.0
Markdown	3.9
charset-normalizer	3.4.3
pymdown-extensions	10.16.1
cffi	2.0.0
typing_extensions	4.15.0
setuptools	80.9.0
msgpack	1.1.1
jedi	0.19.2
crc32c	2.7.1
websockets	15.0.1
parso	0.8.5
statsmodels	0.14.5
threadpoolctl	3.6.0
colorama	0.4.6
zarr	3.1.2
pillow	11.3.0
numba	0.61.2
llvmlite	0.44.0
session-info2	0.2.1
scikit-learn	1.7.2
psutil	7.0.0
six	1.17.0
narwhals	2.4.0
itsdangerous	2.2.0
anyio	4.10.0
Pygments	2.19.2
joblib	1.5.2
PyYAML	6.0.2
patsy	1.0.1
pyparsing	3.2.3
h11	0.16.0
packaging	25.0
matplotlib	3.10.6
legacy-api-wrap	1.4.1
uvicorn	0.35.0
kiwisolver	1.4.9
natsort	8.4.0
click	8.2.1
numcodecs	0.16.1
donfig	0.8.1.post1
python-dateutil	2.9.0.post0
docutils	0.22
platformdirs	4.5.0
pytz	2025.2
zstandard	0.25.0
tomlkit	0.13.3
tqdm	4.67.1
pycparser	2.22
sniffio	1.3.1
starlette	0.47.3
cycler	0.12.1
----	----
Python	3.13.7 | packaged by conda-forge | (main, Sep  3 2025, 14:30:35) [GCC 14.3.0]
OS	Linux-6.8.0-85-generic-x86_64-with-glibc2.39
CPU	128 logical CPU cores, x86_64
GPU	No GPU found
Updated	2025-10-17 15:04

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions