Adding a registry to have the hashes of datasets (restructured for aws s3) #1076

selmanozleyen · 2025-12-06T06:20:28Z

again a continuation of #1072 to add hashes and use the uploaded datasets

changes made

Now we don't download in unit tests since it's costly for s3 and we know they should be active links since we host it
Some datasets were using scanpy defaults and others where using DEFAULT_CACHE_DIR = '~/.cache/squidpy' now we do DEFAULT_CACHE_DIR because it is a global path
updated .scripts/ci/download_data.py to download all datasets
One unified DatasetDownloader class instead of multiple ways to download datasets in squidpy
Everything hard-coded like links and dataset names,links, hashes are in only datasets.yaml registry and when interfacing from registries e.g. visium_hne_image = _make_image_loader("visium_hne_image")
Now we don't need to redownload everything when the script is updated. It only triggers the run of the download file but the downloaded files from old script remains

for more information, see https://pre-commit.ci

codecov · 2025-12-07T12:18:29Z

Codecov Report

❌ Patch coverage is 75.40453% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.45%. Comparing base (c653810) to head (34fbf40).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/squidpy/datasets/_downloader.py	62.59%	36 Missing and 13 partials ⚠️
src/squidpy/datasets/_registry.py	86.23%	8 Missing and 7 partials ⚠️
src/squidpy/datasets/_datasets.py	80.64%	7 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   66.60%   66.45%   -0.16%     
==========================================
  Files          45       44       -1     
  Lines        7017     7115      +98     
  Branches     1185     1199      +14     
==========================================
+ Hits         4674     4728      +54     
- Misses       1880     1913      +33     
- Partials      463      474      +11

Files with missing lines	Coverage Δ
src/squidpy/gr/_build.py	`88.47% <100.00%> (ø)`
src/squidpy/read/_read.py	`35.16% <100.00%> (-0.71%)`	⬇️
src/squidpy/read/_utils.py	`76.19% <100.00%> (+0.58%)`	⬆️
src/squidpy/datasets/_datasets.py	`80.64% <80.64%> (ø)`
src/squidpy/datasets/_registry.py	`86.23% <86.23%> (ø)`
src/squidpy/datasets/_downloader.py	`62.59% <62.59%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…e its relative. It's fine if set it globally

for more information, see https://pre-commit.ci

…r each new script

…/squidpy into add-dataset-hashes

.scripts/ci/download_data.py

src/squidpy/datasets/__init__.py

src/squidpy/datasets/_datasets.py

src/squidpy/datasets/_downloader.py

tests/datasets/test_downloader.py

.scripts/ci/download_data.py

src/squidpy/datasets/_downloader.py

src/squidpy/datasets/_datasets.py

src/squidpy/datasets/_downloader.py

src/squidpy/datasets/_datasets.py

src/squidpy/datasets/_downloader.py

ilan-gold · 2025-12-15T12:34:08Z

src/squidpy/datasets/_downloader.py

+        self.cache_dir = Path(cache_dir) if cache_dir else Path(settings.datasetdir)
+        self.cache_dir.mkdir(parents=True, exist_ok=True)
+
+        self._registry = get_registry()


If a downloader is attached to the output of get_registry, and only one instance of the registry is shared throughout the codebase, what is the point of calling get_registry in other locations? Why not just make DatasetDownloader._registry public and always use that?

I updated my response @ilan-gold, I realized my first response was a bit unorganized.

Completely valid questions. Technically as is there won't be race-conditions as they are both singlethons.

But I made registry public since it will be that downloaders registry we are interested in. And DatasetDownloader will require a registry in constructor to make it clearer.

ilan-gold · 2025-12-15T12:38:12Z

.scripts/ci/download_data.py

+    for name in registry.anndata_datasets:
+        obj = getattr(sq.datasets, name)()
+        assert isinstance(obj, AnnData)

-            assert path.is_dir(), f"Expected a .zarr folder at {path}"
-            continue
+    for name in registry.image_datasets:
+        obj = getattr(sq.datasets, name)()
+        assert isinstance(obj, sq.im.ImageContainer)

-        path = _ROOT / f"{func_name}.{ext}"
-        _print_message(func_name, path)
-        obj = _maybe_download_data(func_name, path)
+    for name in registry.spatialdata_datasets:
+        getattr(sq.datasets, name)()


Why use get_registry if you can just use the constants you define in datasets/__init__.py?

Those are Literal type annotations for the IDE, docs and the users. I am not sure we should use them as our source of information in a code.

ilan-gold · 2025-12-15T12:41:07Z

src/squidpy/datasets/_datasets.py

+    registry = get_registry()
+    if sample_id not in registry:
+        msg = f"Unknown Visium sample: {sample_id}. "
+        msg += f"Available samples: {registry.visium_datasets}"
+        raise ValueError(msg)


To clarify my point about DatasetDownloader._registry, I think that it would be simpler to do if sample_id not in downloader.registry - what if the registry were different in get_registry from the dataset downloader registry (perhaps because of a bug or race condition)?

src/squidpy/datasets/_downloader.py

flying-sheep

Looks pretty great!

I found one issue (trying to “illegally” turn importlib.resources.abc.Traversable into a Path), otherwise all looks good!

Please check what happens with the cache when you specify different options for the dataset loader functions. E.g. I checked include_hires_tiff and if I understood the code right, it seems to work perfectly:

when it’s True, the hires image gets added to the cache (even if the rest is already cached)
when it’s False, the hires image stays in the cache if it’s there, but the returned value has no hires images.

But please check that that’s true!

src/squidpy/datasets/_downloader.py

src/squidpy/datasets/_registry.py

is_adata_with_image property has_hires_image property

for more information, see https://pre-commit.ci

flying-sheep

Awesome! Some small nitpicks, otherwise this looks great.

I like how the _make_loader deduplication turned out: very elegant with the entry.type.

src/squidpy/datasets/_registry.py

Co-authored-by: Philipp A. <[email protected]>

selmanozleyen · 2025-12-16T11:04:00Z

I will add the same cache to the notebooks CI to speed them up. I noticed plenty of them are using it

selmanozleyen and others added 4 commits December 6, 2025 07:19

init

1f0ef4e

[pre-commit.ci] auto fixes from pre-commit.com hooks

0a063ce

for more information, see https://pre-commit.ci

Merge branch 'main' into add-dataset-hashes

b301fdc

linter errors

2f7cd58

selmanozleyen mentioned this pull request Dec 6, 2025

Adding a registry to have the hashes of datasets #1072

Closed

selmanozleyen added 2 commits December 6, 2025 07:28

readthedocs fix

63a5d58

extension bug fix

a2cb8bc

selmanozleyen and others added 9 commits December 7, 2025 17:23

Merge branch 'main' into add-dataset-hashes

8e3685e

use cache dir

03288db

all downloads cache to squidpy default. Don't use scanpy default sinc…

b06f7a7

…e its relative. It's fine if set it globally

format

e7f8399

add docs

dc21ced

PathLike refactor

5f75195

redirect notebooks to the correct module

001a733

update script

ec6b02b

[pre-commit.ci] auto fixes from pre-commit.com hooks

f086493

for more information, see https://pre-commit.ci

selmanozleyen self-assigned this Dec 8, 2025

selmanozleyen added 2 commits December 8, 2025 15:38

since we have the hash of downloaded files we don't need to update fo…

7ce86a3

…r each new script

Merge branch 'add-dataset-hashes' of https://github.com/selmanozleyen…

b4a0074

…/squidpy into add-dataset-hashes

selmanozleyen requested review from flying-sheep, ilan-gold and timtreis December 9, 2025 09:39

timtreis requested changes Dec 9, 2025

View reviewed changes

ilan-gold requested changes Dec 9, 2025

View reviewed changes

selmanozleyen added 5 commits December 9, 2025 14:16

update script

2adb236

format

09e3cba

remove agent spoofing

d51b08c

remove fallbacks

8e99d3c

if path is not None

383c820

selmanozleyen requested a review from flying-sheep December 15, 2025 12:36

ilan-gold requested changes Dec 15, 2025

View reviewed changes

selmanozleyen added 7 commits December 15, 2025 13:48

remove fallback test

b45e8e3

added comment about visium_hne_sdata and increased timeout

f430f82

clarify docstring

83f2948

remove fallback urls - I thought I already did :(

68663aa

completely remove fallback from codebase

d241b6f

make registry thing clearer

fca5d9c

update test_downloader

0b92d67

flying-sheep requested changes Dec 15, 2025

View reviewed changes

selmanozleyen and others added 11 commits December 15, 2025 16:39

fix small mistake with file entry

fe44ee1

remove unused functions like is_single_file property

152b918

is_adata_with_image property has_hires_image property

rename to visium_10x for the format

9d91060

raise an ExceptionGroup

0643423

explicit emptyness check

1b3a3ff

apply @flying-sheep's suggestion

1f66d62

apply previous suggestion to other places

605e47b

apply Traversable suggestion

91373b5

get rid of the first_entry thing

2ed76d0

add cache test

e2288b3

[pre-commit.ci] auto fixes from pre-commit.com hooks

c9e2eb5

for more information, see https://pre-commit.ci

flying-sheep approved these changes Dec 16, 2025

View reviewed changes

src/squidpy/datasets/_registry.py Outdated Show resolved Hide resolved

src/squidpy/datasets/_registry.py Outdated Show resolved Hide resolved

src/squidpy/datasets/_registry.py Outdated Show resolved Hide resolved

src/squidpy/datasets/_registry.py Show resolved Hide resolved

selmanozleyen and others added 2 commits December 16, 2025 10:57

Update src/squidpy/datasets/_registry.py

23f7873

Co-authored-by: Philipp A. <[email protected]>

Update src/squidpy/datasets/_registry.py

0e2a581

Co-authored-by: Philipp A. <[email protected]>

selmanozleyen mentioned this pull request Dec 16, 2025

Consider having a _typing.py or a way to restructure common type aliases #1085

Open

selmanozleyen and others added 3 commits December 16, 2025 11:22

import pathlike

30097d4

Merge branch 'main' into add-dataset-hashes

721ab0a

add cache to notebook ci's as well

34fbf40

selmanozleyen merged commit d4f256f into scverse:main Dec 16, 2025
13 checks passed

Adding a registry to have the hashes of datasets (restructured for aws s3) #1076

Adding a registry to have the hashes of datasets (restructured for aws s3) #1076

Uh oh!

Conversation

selmanozleyen commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

selmanozleyen Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

selmanozleyen Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selmanozleyen commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

selmanozleyen commented Dec 6, 2025 •

edited

Loading

codecov bot commented Dec 7, 2025 •

edited

Loading

selmanozleyen Dec 15, 2025 •

edited

Loading

selmanozleyen Dec 15, 2025 •

edited

Loading