Implementation of `sc.tl.score_genes` doesn't match reference

### Please make sure these conditions are met

- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [x] (optional) I have confirmed this bug exists on the main branch of scanpy.

### What happened?

Hi everyone!
I had a look at the implementation of `sc.tl.score_genes` and compared it to what the reference (see #2609 ) is doing and what the corresponding code in Seurat is doing [here](https://github.com/satijalab/seurat/blob/914c0f180bf919898ba573eb28bbeb259aa94cbc/R/utilities.R#L274). My impression is that Seurat is consistent with the paper, but we are deviating from that implementation. I'll detail this below.

This is my understanding of how Seurat/the reference suggest it:
If we want to calculate a score for a set of genes, we have to supply a list of those genes and, in addition, a list of 'reference genes'. The purpose of the reference genes is to "decrease the effect that the quality and complexity of each cell’s data might have" (see Tirosh 2016).  In practice,I'd say users don't explicitly specify reference genes, but instead use the default behavior of letting Seurat select a random subset.
Now, it might be the case that our gene list is mostly a set of lowly expressed transcription factors. If we were to randomly select a reference gene set out of all genes, they probably have larger average expression values compared to our transcription factors, which in turn would make the score negative even for cells that have (comparably) high reads in them. To combat this, we don't select the reference genes randomly, but instead bin genes (in `n_bins`) by their average expression values. Then, for each gene in our genes of interest, we pick reference genes _from the same bin_ to avoid this effect and make the score values center around zero. The way one can interpret this score is: "When comparing our gene list with random reference genes (with similar expression values) - are our genes upregulated (positive score) or downregulated (negative score)." 

In principle, scanpy does the same thing. However, [here](https://github.com/scverse/scanpy/blob/cf8b46dea735c35a629abfaa2e1bab9047281e34/src/scanpy/tools/_score_genes.py#L240), we are basically only adding one random selection per bin that occurs in our gene list, instead of one random selection per gene of interest. I would not have expected the `np.unique()` in the linked line.
Why can this be an issue? Suppose `gene_list` contains 10 lowly expressed genes, and 2 highly expressed genes (that fall in different buckets). This will add 50 lowly expressed genes and 100 highly expressed genes to the reference set, when we'd actually need more like (10 vs 2 (*50)) lowly vs highly expressed genes. This results in scores being biased towards negative values. And vice-versa if we switch the compositions.
I've attached a reproducer for this behavior below.

### Minimal code sample

```python
import scanpy as sc
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create genes with dfferent expression levels
bins = np.repeat(np.linspace(1,100,25), 400)
mat = np.random.choice([1,0], size=(20000,10000))

adata = sc.AnnData(mat * bins)


# Lowly expressed genes
genes_of_interest = ['0', '1', '2', '3', '4', '5']
# Two genes with high expression from different buckets
genes_of_interest += ['9999', '8001']

sc.tl.score_genes(adata, genes_of_interest)

sns.histplot(adata.obs['score'])
plt.show()

# Score is almost completely negative, even though we'd expect it to be approximately normal around zero
```

### Error output

```pytb

```

### Versions

<details>

```
scipy	1.16.2
numpy	2.3.4
scanpy	0.1.0.dev3628+gcfab46942.d20251021 (0.1.0.dev3629+g095941443.d20251022)
seaborn	0.13.2
anndata	0.12.3
matplotlib	3.10.7
pandas	2.3.3
----	----
executing	2.2.1
patsy	1.0.2
pillow	12.0.0
prompt_toolkit	3.0.52
PyYAML	6.0.3
scikit-learn	1.7.2
sparse	0.17.0
tblib	3.2.0
llvmlite	0.45.1
toolz	1.1.0
pure_eval	0.2.3
Deprecated	1.2.18
cloudpickle	3.1.1
pluggy	1.6.0
joblib	1.5.2
dask	2025.10.0
python-dateutil	2.9.0.post0
MarkupSafe	3.0.3
stack-data	0.6.3
asttokens	3.0.0
traitlets	5.14.3
zappy	0.2.0
wcwidth	0.2.14
psutil	7.1.1
legacy-api-wrap	1.4.1
cycler	0.12.1
jaraco.functools	4.0.1
packaging	25.0
jaraco.collections	5.1.0
six	1.17.0
Pygments	2.19.2
kiwisolver	1.4.9
wrapt	1.17.3
session-info2	0.2.3
decorator	5.2.1
asciitree	0.3.3
setuptools-scm	9.2.2
coverage	7.11.0
numcodecs	0.15.1
natsort	8.4.0
threadpoolctl	3.6.0
setuptools	80.9.0
h5py	3.15.1
statsmodels	0.14.5
jaraco.context	5.3.0
jedi	0.19.2
msgpack	1.1.2
parso	0.8.5
pathspec	0.12.1
typing_extensions	4.15.0
pytz	2025.2
hatchling	1.27.0
numba	0.62.1
zarr	2.18.7
jaraco.text	3.12.1
pyparsing	3.2.5
Jinja2	3.1.6
ipython	9.6.0
hatch-vcs	0.5.0
pyarrow	21.0.0
fast-array-utils	1.2.5
more-itertools	10.3.0
----	----
Python	3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:22:19) [Clang 14.0.6 ]
OS	macOS-15.7.1-arm64-arm-64bit
CPU	8 logical CPU cores, arm
GPU	No GPU found
Updated	2025-10-22 16:13
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation of `sc.tl.score_genes` doesn't match reference #3845

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation of sc.tl.score_genes doesn't match reference #3845

Description

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implementation of `sc.tl.score_genes` doesn't match reference #3845