feat: add tiny datasets for lightweight experiments #422

begumcig · 2025-10-29T18:19:50Z

Description

This PR introduces the Tiny CIFAR dataset.

The core implementation was contributed by kris70lesgo in PR #368 and this branch brings their work into the main PrunaAI repository. I made minor adjustments (styling, tests) to align with our current codebase and standards.

Full credit for the original implementation goes to @kris70lesgo 💜💜💜 Thanks a lot for your amazing contribution!

Related Issue

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

- Add setup_tiny_cifar10_dataset() function in datasets/image.py - Register TinyCIFAR10 in base_datasets with image_classification_collate - Add test case for TinyCIFAR10 in test_datamodule.py - Dataset contains <1,000 samples (600 train + ~200 val + 200 test) - Follows same pattern as existing CIFAR10 implementation Resolves #358

- Add comprehensive docstrings explaining 'img' to 'image' column rename - Clarify compatibility requirement with image_classification_collate function - Document expected output schema with column names and types - Explain this is NOT a breaking change but a necessary compatibility fix The column rename ensures CIFAR-10 datasets work seamlessly with Pruna's image_classification_collate function which expects 'image' column name.

davidberenstein1957 · 2025-10-30T12:14:24Z

src/pruna/data/datasets/image.py

+    train_ds = train_ds.rename_column("img", "image")
+    test_ds = test_ds.rename_column("img", "image")
+
+    tiny_train = train_ds.select(range(600))


why are we just getting a this specific smaller subset? Can't we generalise this approach across all datasets and create general logic for getting tiny versions? perhaps to be tackled in a seperate PR?

Yes this makes a lot of sense actually!

davidberenstein1957 · 2025-10-30T12:15:17Z

src/pruna/data/__init__.py

        "image_classification_collate",
        {"img_size": 32},
    ),
+    "TinyCIFAR10": (setup_tiny_cifar10_dataset, "image_classification_collate", {"img_size": 32}),


I can see us re-using somethign like get_tiny(setup_cifar10_dataset) or something.

davidberenstein1957 · 2025-10-30T14:32:21Z

src/pruna/data/__init__.py

+    "TinyCIFAR10": (partial(setup_cifar10_dataset, fraction=0.1), "image_classification_collate", {"img_size": 32}),
+    "TinyMNIST": (partial(setup_mnist_dataset, fraction=0.1), "image_classification_collate", {"img_size": 28}),
+    "TinyImageNet": (partial(setup_imagenet_dataset, fraction=0.1), "image_classification_collate", {"img_size": 224}),


Awesome. Just not 100% sure if a fraction for each of these datasets is small enough, and it is clear how many samples we get now? We could also allow a range/number or something. Not sure if that would be better, but otherwise we can keep it like this.

I completely see your point here, I only did fractions since we have limit_datasets in PrunaDataModule that allows us to give a number to limit the dataset. If you still think also having a number rather than a fraction here makes more sense I am happy to change it, what do you think?

I think setting it to fixed numbers is nicer as we have more control and awareness surrounding the number.

You are right, I have also added this feature 🧡🧡

davidberenstein1957

Looks good, one minal remark but feel free to merge after.

kris70lesgo and others added 4 commits October 29, 2025 18:07

fixed errors

afdd4c6

refactor: remove note section from the docstrings

213dc34

begumcig changed the title ~~Add TinyCIFAR10 dataset for lightweight experiments~~ feat: add TinyCIFAR10 dataset for lightweight experiments Oct 30, 2025

davidberenstein1957 reviewed Oct 30, 2025

View reviewed changes

begumcig added 2 commits October 30, 2025 13:38

feat: tiny dataset refactor for all image classification datasets

1ebdd79

test: add tiny dataset sets

3ea62c5

begumcig requested a review from davidberenstein1957 October 30, 2025 13:42

davidberenstein1957 reviewed Oct 30, 2025

View reviewed changes

feat: add stratifying by sample size for image classification datasets

7100732

begumcig force-pushed the pr2 branch from da2eb1b to 7100732 Compare October 31, 2025 10:59

begumcig changed the title ~~feat: add TinyCIFAR10 dataset for lightweight experiments~~ feat: add tiny datasets for lightweight experiments Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add tiny datasets for lightweight experiments #422

feat: add tiny datasets for lightweight experiments #422

Uh oh!

begumcig commented Oct 29, 2025

Uh oh!

davidberenstein1957 Oct 30, 2025

Uh oh!

begumcig Oct 30, 2025

Uh oh!

davidberenstein1957 Oct 30, 2025

Uh oh!

davidberenstein1957 Oct 30, 2025

Uh oh!

begumcig Oct 30, 2025

Uh oh!

davidberenstein1957 Oct 30, 2025

Uh oh!

begumcig Oct 31, 2025

Uh oh!

davidberenstein1957 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add tiny datasets for lightweight experiments #422

Are you sure you want to change the base?

feat: add tiny datasets for lightweight experiments #422

Uh oh!

Conversation

begumcig commented Oct 29, 2025

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

davidberenstein1957 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

begumcig Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

begumcig Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

begumcig Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants