-
Notifications
You must be signed in to change notification settings - Fork 101
Adding a registry to have the hashes of datasets (restructured for aws s3) #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a registry to have the hashes of datasets (restructured for aws s3) #1076
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1076 +/- ##
==========================================
- Coverage 66.60% 66.45% -0.16%
==========================================
Files 45 44 -1
Lines 7017 7115 +98
Branches 1185 1199 +14
==========================================
+ Hits 4674 4728 +54
- Misses 1880 1913 +33
- Partials 463 474 +11
🚀 New features to boost your workflow:
|
…e its relative. It's fine if set it globally
for more information, see https://pre-commit.ci
…r each new script
…/squidpy into add-dataset-hashes
src/squidpy/datasets/_downloader.py
Outdated
| self.cache_dir = Path(cache_dir) if cache_dir else Path(settings.datasetdir) | ||
| self.cache_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| self._registry = get_registry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a downloader is attached to the output of get_registry, and only one instance of the registry is shared throughout the codebase, what is the point of calling get_registry in other locations? Why not just make DatasetDownloader._registry public and always use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated my response @ilan-gold, I realized my first response was a bit unorganized.
Completely valid questions. Technically as is there won't be race-conditions as they are both singlethons.
But I made registry public since it will be that downloaders registry we are interested in. And DatasetDownloader will require a registry in constructor to make it clearer.
| for name in registry.anndata_datasets: | ||
| obj = getattr(sq.datasets, name)() | ||
| assert isinstance(obj, AnnData) | ||
|
|
||
| assert path.is_dir(), f"Expected a .zarr folder at {path}" | ||
| continue | ||
| for name in registry.image_datasets: | ||
| obj = getattr(sq.datasets, name)() | ||
| assert isinstance(obj, sq.im.ImageContainer) | ||
|
|
||
| path = _ROOT / f"{func_name}.{ext}" | ||
| _print_message(func_name, path) | ||
| obj = _maybe_download_data(func_name, path) | ||
| for name in registry.spatialdata_datasets: | ||
| getattr(sq.datasets, name)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use get_registry if you can just use the constants you define in datasets/__init__.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are Literal type annotations for the IDE, docs and the users. I am not sure we should use them as our source of information in a code.
src/squidpy/datasets/_datasets.py
Outdated
| registry = get_registry() | ||
| if sample_id not in registry: | ||
| msg = f"Unknown Visium sample: {sample_id}. " | ||
| msg += f"Available samples: {registry.visium_datasets}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my point about DatasetDownloader._registry, I think that it would be simpler to do if sample_id not in downloader.registry - what if the registry were different in get_registry from the dataset downloader registry (perhaps because of a bug or race condition)?
flying-sheep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty great!
I found one issue (trying to “illegally” turn importlib.resources.abc.Traversable into a Path), otherwise all looks good!
Please check what happens with the cache when you specify different options for the dataset loader functions. E.g. I checked include_hires_tiff and if I understood the code right, it seems to work perfectly:
- when it’s
True, the hires image gets added to the cache (even if the rest is already cached) - when it’s
False, the hires image stays in the cache if it’s there, but the returned value has no hires images.
But please check that that’s true!
is_adata_with_image property has_hires_image property
for more information, see https://pre-commit.ci
flying-sheep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Some small nitpicks, otherwise this looks great.
I like how the _make_loader deduplication turned out: very elegant with the entry.type.
Co-authored-by: Philipp A. <[email protected]>
Co-authored-by: Philipp A. <[email protected]>
|
I will add the same cache to the notebooks CI to speed them up. I noticed plenty of them are using it |
again a continuation of #1072 to add hashes and use the uploaded datasets
changes made
visium_hne_image = _make_image_loader("visium_hne_image")