Skip to content

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Dec 6, 2025

again a continuation of #1072 to add hashes and use the uploaded datasets

changes made

  • Now we don't download in unit tests since it's costly for s3 and we know they should be active links since we host it
  • Some datasets were using scanpy defaults and others where using DEFAULT_CACHE_DIR = '~/.cache/squidpy' now we do DEFAULT_CACHE_DIR because it is a global path
  • updated .scripts/ci/download_data.py to download all datasets
  • One unified DatasetDownloader class instead of multiple ways to download datasets in squidpy
  • Everything hard-coded like links and dataset names,links, hashes are in only datasets.yaml registry and when interfacing from registries e.g. visium_hne_image = _make_image_loader("visium_hne_image")
  • Now we don't need to redownload everything when the script is updated. It only triggers the run of the download file but the downloaded files from old script remains

@codecov
Copy link

codecov bot commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 75.40453% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.45%. Comparing base (c653810) to head (34fbf40).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/squidpy/datasets/_downloader.py 62.59% 36 Missing and 13 partials ⚠️
src/squidpy/datasets/_registry.py 86.23% 8 Missing and 7 partials ⚠️
src/squidpy/datasets/_datasets.py 80.64% 7 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   66.60%   66.45%   -0.16%     
==========================================
  Files          45       44       -1     
  Lines        7017     7115      +98     
  Branches     1185     1199      +14     
==========================================
+ Hits         4674     4728      +54     
- Misses       1880     1913      +33     
- Partials      463      474      +11     
Files with missing lines Coverage Δ
src/squidpy/gr/_build.py 88.47% <100.00%> (ø)
src/squidpy/read/_read.py 35.16% <100.00%> (-0.71%) ⬇️
src/squidpy/read/_utils.py 76.19% <100.00%> (+0.58%) ⬆️
src/squidpy/datasets/_datasets.py 80.64% <80.64%> (ø)
src/squidpy/datasets/_registry.py 86.23% <86.23%> (ø)
src/squidpy/datasets/_downloader.py 62.59% <62.59%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@selmanozleyen selmanozleyen self-assigned this Dec 8, 2025
self.cache_dir = Path(cache_dir) if cache_dir else Path(settings.datasetdir)
self.cache_dir.mkdir(parents=True, exist_ok=True)

self._registry = get_registry()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a downloader is attached to the output of get_registry, and only one instance of the registry is shared throughout the codebase, what is the point of calling get_registry in other locations? Why not just make DatasetDownloader._registry public and always use that?

Copy link
Member Author

@selmanozleyen selmanozleyen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated my response @ilan-gold, I realized my first response was a bit unorganized.

Completely valid questions. Technically as is there won't be race-conditions as they are both singlethons.

But I made registry public since it will be that downloaders registry we are interested in. And DatasetDownloader will require a registry in constructor to make it clearer.

Comment on lines +45 to +54
for name in registry.anndata_datasets:
obj = getattr(sq.datasets, name)()
assert isinstance(obj, AnnData)

assert path.is_dir(), f"Expected a .zarr folder at {path}"
continue
for name in registry.image_datasets:
obj = getattr(sq.datasets, name)()
assert isinstance(obj, sq.im.ImageContainer)

path = _ROOT / f"{func_name}.{ext}"
_print_message(func_name, path)
obj = _maybe_download_data(func_name, path)
for name in registry.spatialdata_datasets:
getattr(sq.datasets, name)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use get_registry if you can just use the constants you define in datasets/__init__.py?

Copy link
Member Author

@selmanozleyen selmanozleyen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are Literal type annotations for the IDE, docs and the users. I am not sure we should use them as our source of information in a code.

Comment on lines 127 to 131
registry = get_registry()
if sample_id not in registry:
msg = f"Unknown Visium sample: {sample_id}. "
msg += f"Available samples: {registry.visium_datasets}"
raise ValueError(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my point about DatasetDownloader._registry, I think that it would be simpler to do if sample_id not in downloader.registry - what if the registry were different in get_registry from the dataset downloader registry (perhaps because of a bug or race condition)?

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty great!

I found one issue (trying to “illegally” turn importlib.resources.abc.Traversable into a Path), otherwise all looks good!

Please check what happens with the cache when you specify different options for the dataset loader functions. E.g. I checked include_hires_tiff and if I understood the code right, it seems to work perfectly:

  • when it’s True, the hires image gets added to the cache (even if the rest is already cached)
  • when it’s False, the hires image stays in the cache if it’s there, but the returned value has no hires images.

But please check that that’s true!

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Some small nitpicks, otherwise this looks great.

I like how the _make_loader deduplication turned out: very elegant with the entry.type.

@selmanozleyen
Copy link
Member Author

I will add the same cache to the notebooks CI to speed them up. I noticed plenty of them are using it

@selmanozleyen selmanozleyen merged commit d4f256f into scverse:main Dec 16, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Put datasets on S3

5 participants