-
Notifications
You must be signed in to change notification settings - Fork 675
Open
Labels
Needs info❔More information neededMore information needed
Description
Please make sure these conditions are met
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of scanpy.
- (optional) I have confirmed this bug exists on the main branch of scanpy.
What happened?
I'm trying to predict my doublets using scrublet. So, I'm using the pp.scrublet function to do that. The thing is, when I run it without adding the parameter batch_key, it works. But is failing when I try to use that parameter.
I tried these options before submitting the Issue:
- Checked for NaN in original Anndata and Anndata
adata_after_qc - Checked for NaN in Anndata with simulated doublets
adata_sim - It's not a problem on my
.obsbecause I already used the batch effect correction in other Scanpy functions and it worked - Ran the function with and without the parameter and only works without the parameter
- Run the simulation within each batch independently
I think it might be a problem managing the batches when you have the combination of pre-simulated doublets using sc.pp.scrublet_simulate_doublets() + original Anndata + batch_key because that the output of sc.pp.scrublet_simulate_doublets() removes the .obs in the Anndata output.
Minimal code sample
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "scanpy@git+https://github.com/scverse/scanpy.git@main",
# ]
# ///
#
# This script automatically imports the development branch of scanpy to check for issues
import scanpy as sc
# TODO: create `adata_after_qc`
# Simulate an AnnData similar to the one in question
n_obs, n_vars = 4151, 8342
dtype = np.int64
# Generate a sparse random count matrix (Poisson-distributed)
X = sp.csr_matrix(np.random.poisson(1, size=(n_obs, n_vars)), dtype=dtype)
adata_dummy = ad.AnnData(X)
# Example metadata
adata_dummy.obs["batch"] = np.random.choice(["r1", "r2", "r3", "r4"], size=n_obs)
adata_dummy.var["gene_symbol"] = [f"gene_{i}" for i in range(n_vars)]
# Store raw counts layer
adata_dummy.layers["raw_counts"] = adata_dummy.X.copy()
# Select the gene that I will modify to have zero variance
zero_var_gene_idx = 13
zero_var_gene_name = adata_dummy.var_names[zero_var_gene_idx]
# Get indices of cells belonging to batch 'r1'
batch_mask = adata_dummy.obs["batch"] == "r1"
r1_idx = np.where(batch_mask)[0]
# Set that gene's values in r1 to a constant
X_dense = adata_dummy.X.toarray()
X_dense[r1_idx, zero_var_gene_idx] = 2 # Constant value of gene
# Convert back to csr sparse matrix
adata_dummy.X = sp.csr_matrix(X_dense)
adata_sim = sc.pp.scrublet_simulate_doublets(
adata_dummy,
synthetic_doublet_umi_subsampling = 1.0, # 1.0 == doublet is created adding counts from two random samples
layer="raw_counts"
)
sc.pp.scrublet(
adata_dummy, # Original anndata
adata_sim=adata_sim, # Anndata with simulated doublets and modified .X
knn_dist_metric="euclidean",
n_prin_comps=20,
batch_key= "batch"
)Error output
ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-valuesVersions
marimo 0.15.2
scanpy 1.11.4
anndata 0.12.2
seaborn 0.13.2
numpy 2.2.6
scipy 1.16.1
pooch 1.8.2 (v1.8.2)
pandas 2.3.2
---- ----
h5py 3.14.0
Markdown 3.9
charset-normalizer 3.4.3
pymdown-extensions 10.16.1
cffi 2.0.0
typing_extensions 4.15.0
setuptools 80.9.0
msgpack 1.1.1
jedi 0.19.2
crc32c 2.7.1
websockets 15.0.1
parso 0.8.5
statsmodels 0.14.5
threadpoolctl 3.6.0
colorama 0.4.6
zarr 3.1.2
pillow 11.3.0
numba 0.61.2
llvmlite 0.44.0
session-info2 0.2.1
scikit-learn 1.7.2
psutil 7.0.0
six 1.17.0
narwhals 2.4.0
itsdangerous 2.2.0
anyio 4.10.0
Pygments 2.19.2
joblib 1.5.2
PyYAML 6.0.2
patsy 1.0.1
pyparsing 3.2.3
h11 0.16.0
packaging 25.0
matplotlib 3.10.6
legacy-api-wrap 1.4.1
uvicorn 0.35.0
kiwisolver 1.4.9
natsort 8.4.0
click 8.2.1
numcodecs 0.16.1
donfig 0.8.1.post1
python-dateutil 2.9.0.post0
docutils 0.22
platformdirs 4.5.0
pytz 2025.2
zstandard 0.25.0
tomlkit 0.13.3
tqdm 4.67.1
pycparser 2.22
sniffio 1.3.1
starlette 0.47.3
cycler 0.12.1
---- ----
Python 3.13.7 | packaged by conda-forge | (main, Sep 3 2025, 14:30:35) [GCC 14.3.0]
OS Linux-6.8.0-85-generic-x86_64-with-glibc2.39
CPU 128 logical CPU cores, x86_64
GPU No GPU found
Updated 2025-10-17 15:04
Metadata
Metadata
Assignees
Labels
Needs info❔More information neededMore information needed