Skip to content

Function to subset the entire spatialdata object #1007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timtreis opened this issue May 26, 2025 · 3 comments · May be fixed by #967
Open

Function to subset the entire spatialdata object #1007

timtreis opened this issue May 26, 2025 · 3 comments · May be fixed by #967
Assignees
Labels
enhancement ✨ New feature or request squidpy2.0 Everything releated to a Squidpy 2.0 release

Comments

@timtreis
Copy link
Member

timtreis commented May 26, 2025

A little bit I started on this PR #967 which introduces a function that allows users to subset their entire SpatialData objects by certain criteria. The larger goal is to emulate this Scanpy notebook and to make Squidpy basically the biologist-friendly interface to SpatialData. Subsetting your object will be the first step in that journey.

For this, there are several considerations:

  • a given SpatialData object can contain 0-n AnnData objects
  • these AnnData objects can annotate 0-n other objects, f.e. segmentation masks, shapes (like for Visium), ROIs or even points
  • a given subsetting step on the AnnData object needs to find all instances that are annotated by these soon-to-be-gone observations in all other elements and deal with them accordingly:
    • segmentation masks -> set to 0 (background)
      • potentially: remove transcript locations falling into these segmentation masks
    • shapes -> remove
    • points -> remove
    • etc

However, there are additional constraints and open questions that are important for the implementation.

  • We can f.e. store segmentation masks as DataTrees with different scales - is it faster to subset the original resolution and to then regenerate the tree or subset all scales individually?
  • How do we handle inplace True vs False? Returning a copy can easily mean doubling a 500 GB object.

Some other edge cases might only really show up once there.

Generally, the goal should be to identify relevant subfunctions and push these upstream to SpatialData, some might already exist there and just need to be found (realistically by asking @LucaMarconato, there's quite a few functions only he really knows about), other might need to be written and pushed upstream. Ideally Squidpy then chains together these functions into something with good UX.

@timtreis timtreis added enhancement ✨ New feature or request squidpy2.0 Everything releated to a Squidpy 2.0 release labels May 26, 2025
@timtreis timtreis linked a pull request May 26, 2025 that will close this issue
@LucaMarconato
Copy link
Member

We can f.e. store segmentation masks as DataTrees with different scales - is it faster to subset the original resolution and to then regenerate the tree or subset all scales individually?

I'd try benchmark this. It's all lazy anyway until things are written to disk. One important consideration is that if we want to subset and access the data in memory. In that case if we compute all the scales from the first, then we need to write it and reread it (or call .persist() from dask), otherwise the computation of the lower scales from the first scale is re-performed everytime! This would favor slicing every scale instead of computing them from the largest.

@LucaMarconato
Copy link
Member

How do we handle inplace True vs False? Returning a copy can easily mean doubling a 500 GB object.

No copy is done because the heavy data (images and labels) are lazy. The copy will come only on-disk. So this should not be a problem.

@LucaMarconato
Copy link
Member

Generally, the goal should be to identify relevant subfunctions and push these upstream to SpatialData, some might already exist there and just need to be found (realistically by asking @LucaMarconato, there's quite a few functions only he really knows about), other might need to be written and pushed upstream. Ideally Squidpy then chains together these functions into something with good UX.

For table-based subsetting this is the most up-to-date code we have: scverse/spatialdata#894

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ✨ New feature or request squidpy2.0 Everything releated to a Squidpy 2.0 release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants