-
Notifications
You must be signed in to change notification settings - Fork 101
Adding a registry to have the hashes of datasets (restructured for aws s3) #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
selmanozleyen
merged 59 commits into
scverse:main
from
selmanozleyen:add-dataset-hashes
Dec 16, 2025
Merged
Changes from 36 commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
1f0ef4e
init
selmanozleyen 0a063ce
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b301fdc
Merge branch 'main' into add-dataset-hashes
selmanozleyen 2f7cd58
linter errors
selmanozleyen 63a5d58
readthedocs fix
selmanozleyen a2cb8bc
extension bug fix
selmanozleyen 8e3685e
Merge branch 'main' into add-dataset-hashes
selmanozleyen 03288db
use cache dir
selmanozleyen b06f7a7
all downloads cache to squidpy default. Don't use scanpy default sinc…
selmanozleyen e7f8399
format
selmanozleyen dc21ced
add docs
selmanozleyen 5f75195
PathLike refactor
selmanozleyen 001a733
redirect notebooks to the correct module
selmanozleyen ec6b02b
update script
selmanozleyen f086493
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 7ce86a3
since we have the hash of downloaded files we don't need to update fo…
selmanozleyen b4a0074
Merge branch 'add-dataset-hashes' of https://github.com/selmanozleyen…
selmanozleyen 2adb236
update script
selmanozleyen 09e3cba
format
selmanozleyen d51b08c
remove agent spoofing
selmanozleyen 8e99d3c
remove fallbacks
selmanozleyen 383c820
if path is not None
selmanozleyen adc9d53
more structured logic
selmanozleyen d38caf2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 21be892
resolve comments
selmanozleyen d0e3b29
fix logging import
selmanozleyen 6546509
replace all "from scanpy import logging as logg" with spatiadata loggers
selmanozleyen 99e6681
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] dc0f561
clarify comments
selmanozleyen 4fb80da
Merge branch 'main' into add-dataset-hashes
selmanozleyen 3cb23da
use sc.settings
selmanozleyen ff967a8
fix datapath
selmanozleyen c042f17
spatialdata logger doesn't accept time. will create an issue about lo…
selmanozleyen e2ef094
revert logging thing to put it to a separate issue
selmanozleyen 595e744
fix blunder
selmanozleyen b45cc6a
fix path comparison
selmanozleyen b45e8e3
remove fallback test
selmanozleyen f430f82
added comment about visium_hne_sdata and increased timeout
selmanozleyen 83f2948
clarify docstring
selmanozleyen 68663aa
remove fallback urls - I thought I already did :(
selmanozleyen d241b6f
completely remove fallback from codebase
selmanozleyen fca5d9c
make registry thing clearer
selmanozleyen 0b92d67
update test_downloader
selmanozleyen fe44ee1
fix small mistake with file entry
selmanozleyen 152b918
remove unused functions like is_single_file property
selmanozleyen 9d91060
rename to visium_10x for the format
selmanozleyen 0643423
raise an ExceptionGroup
selmanozleyen 1b3a3ff
explicit emptyness check
selmanozleyen 1f66d62
apply @flying-sheep's suggestion
selmanozleyen 605e47b
apply previous suggestion to other places
selmanozleyen 91373b5
apply Traversable suggestion
selmanozleyen 2ed76d0
get rid of the first_entry thing
selmanozleyen e2288b3
add cache test
selmanozleyen c9e2eb5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 23f7873
Update src/squidpy/datasets/_registry.py
selmanozleyen 0e2a581
Update src/squidpy/datasets/_registry.py
selmanozleyen 30097d4
import pathlike
selmanozleyen 721ab0a
Merge branch 'main' into add-dataset-hashes
selmanozleyen 34fbf40
add cache to notebook ci's as well
selmanozleyen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,77 +1,69 @@ | ||
| #!/usr/bin/env python3 | ||
| """Download datasets to populate CI cache. | ||
| This script downloads all datasets that tests might need. | ||
| The downloader handles caching to scanpy.settings.datasetdir. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| from squidpy.datasets import visium_hne_sdata | ||
| from scanpy import settings | ||
| from spatialdata._logging import logger | ||
|
|
||
| _CNT = 0 # increment this when you want to rebuild the CI cache | ||
| _ROOT = Path.home() / ".cache" / "squidpy" | ||
|
|
||
|
|
||
| def _print_message(func_name: str, path: Path, *, dry_run: bool = False) -> None: | ||
| prefix = "[DRY RUN]" if dry_run else "" | ||
| if path.is_file(): | ||
| print(f"{prefix}[Loading] {func_name:>25} <- {str(path):>25}") | ||
| else: | ||
| print(f"{prefix}[Downloading] {func_name:>25} -> {str(path):>25}") | ||
|
|
||
|
|
||
| def _maybe_download_data(func_name: str, path: Path) -> Any: | ||
| import squidpy as sq | ||
|
|
||
| try: | ||
| return getattr(sq.datasets, func_name)(path=path) | ||
| except Exception as e: # noqa: BLE001 | ||
| print(f"File {str(path):>25} seems to be corrupted: {e}. Removing and retrying") | ||
| path.unlink() | ||
|
|
||
| return getattr(sq.datasets, func_name)(path=path) | ||
|
|
||
|
|
||
| def main(args: argparse.Namespace) -> None: | ||
| from anndata import AnnData | ||
|
|
||
| import squidpy as sq | ||
| from squidpy.datasets._registry import get_registry | ||
|
|
||
| all_datasets = sq.datasets._dataset.__all__ + sq.datasets._image.__all__ | ||
| all_extensions = ["h5ad"] * len(sq.datasets._dataset.__all__) + ["tiff"] * len(sq.datasets._image.__all__) | ||
| registry = get_registry() | ||
|
|
||
| # Visium samples tested in CI | ||
| visium_samples_to_cache = [ | ||
| "V1_Mouse_Kidney", | ||
| "Targeted_Visium_Human_SpinalCord_Neuroscience", | ||
| "Visium_FFPE_Human_Breast_Cancer", | ||
| ] | ||
|
|
||
| if args.dry_run: | ||
| for func_name, ext in zip(all_datasets, all_extensions): | ||
| if func_name == "visium_hne_sdata": | ||
| ext = "zarr" | ||
| path = _ROOT / f"{func_name}.{ext}" | ||
| _print_message(func_name, path, dry_run=True) | ||
| logger.info("Cache: %s", settings.datasetdir) | ||
| logger.info( | ||
| "Would download: %d AnnData, %d images, %d SpatialData, %d Visium", | ||
| len(registry.anndata_datasets), | ||
| len(registry.image_datasets), | ||
| len(registry.spatialdata_datasets), | ||
| len(visium_samples_to_cache), | ||
| ) | ||
| return | ||
|
|
||
| # could be parallelized, but on CI it largely does not matter (usually limited to 2 cores + bandwidth limit) | ||
| for func_name, ext in zip(all_datasets, all_extensions): | ||
| if func_name == "visium_hne_sdata": | ||
| ext = "zarr" | ||
| path = _ROOT / f"{func_name}.{ext}" | ||
|
|
||
| _print_message(func_name, path) | ||
| obj = visium_hne_sdata(_ROOT) | ||
| # Download all datasets - the downloader handles caching | ||
| for name in registry.anndata_datasets: | ||
| obj = getattr(sq.datasets, name)() | ||
| assert isinstance(obj, AnnData) | ||
|
|
||
| assert path.is_dir(), f"Expected a .zarr folder at {path}" | ||
| continue | ||
| for name in registry.image_datasets: | ||
| obj = getattr(sq.datasets, name)() | ||
| assert isinstance(obj, sq.im.ImageContainer) | ||
|
|
||
| path = _ROOT / f"{func_name}.{ext}" | ||
| _print_message(func_name, path) | ||
| obj = _maybe_download_data(func_name, path) | ||
| for name in registry.spatialdata_datasets: | ||
| getattr(sq.datasets, name)() | ||
|
|
||
| # we could do without the AnnData check as well (1 less req. in tox.ini), but it's better to be safe | ||
| assert isinstance(obj, AnnData | sq.im.ImageContainer), type(obj) | ||
| assert path.is_file(), path | ||
| for sample in visium_samples_to_cache: | ||
| obj = sq.datasets.visium(sample, include_hires_tiff=True) | ||
| assert isinstance(obj, AnnData) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="Download data used for tutorials/examples.") | ||
| parser = argparse.ArgumentParser(description="Download datasets to populate CI cache.") | ||
| parser.add_argument( | ||
| "--dry-run", action="store_true", help="Do not download any data, just print what would be downloaded." | ||
| "--dry-run", | ||
| action="store_true", | ||
| help="Do not download, just print what would be downloaded.", | ||
| ) | ||
|
|
||
| main(parser.parse_args()) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use
get_registryif you can just use the constants you define indatasets/__init__.py?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are
Literaltype annotations for the IDE, docs and the users. I am not sure we should use them as our source of information in a code.