Skip to content

Conversation

@begumcig
Copy link
Member

Description

This PR introduces the Tiny CIFAR dataset.

The core implementation was contributed by kris70lesgo in PR #368 and this branch brings their work into the main PrunaAI repository. I made minor adjustments (styling, tests) to align with our current codebase and standards.

Full credit for the original implementation goes to @kris70lesgo 💜💜💜 Thanks a lot for your amazing contribution!

Related Issue

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

kris70lesgo and others added 4 commits October 29, 2025 18:07
- Add setup_tiny_cifar10_dataset() function in datasets/image.py
- Register TinyCIFAR10 in base_datasets with image_classification_collate
- Add test case for TinyCIFAR10 in test_datamodule.py
- Dataset contains <1,000 samples (600 train + ~200 val + 200 test)
- Follows same pattern as existing CIFAR10 implementation

Resolves #358
- Add comprehensive docstrings explaining 'img' to 'image' column rename
- Clarify compatibility requirement with image_classification_collate function
- Document expected output schema with column names and types
- Explain this is NOT a breaking change but a necessary compatibility fix

The column rename ensures CIFAR-10 datasets work seamlessly with Pruna's
image_classification_collate function which expects 'image' column name.
@begumcig begumcig changed the title Add TinyCIFAR10 dataset for lightweight experiments feat: add TinyCIFAR10 dataset for lightweight experiments Oct 30, 2025
train_ds = train_ds.rename_column("img", "image")
test_ds = test_ds.rename_column("img", "image")

tiny_train = train_ds.select(range(600))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we just getting a this specific smaller subset? Can't we generalise this approach across all datasets and create general logic for getting tiny versions? perhaps to be tackled in a seperate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this makes a lot of sense actually!

"image_classification_collate",
{"img_size": 32},
),
"TinyCIFAR10": (setup_tiny_cifar10_dataset, "image_classification_collate", {"img_size": 32}),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see us re-using somethign like get_tiny(setup_cifar10_dataset) or something.

Comment on lines 81 to 83
"TinyCIFAR10": (partial(setup_cifar10_dataset, fraction=0.1), "image_classification_collate", {"img_size": 32}),
"TinyMNIST": (partial(setup_mnist_dataset, fraction=0.1), "image_classification_collate", {"img_size": 28}),
"TinyImageNet": (partial(setup_imagenet_dataset, fraction=0.1), "image_classification_collate", {"img_size": 224}),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Just not 100% sure if a fraction for each of these datasets is small enough, and it is clear how many samples we get now? We could also allow a range/number or something. Not sure if that would be better, but otherwise we can keep it like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely see your point here, I only did fractions since we have limit_datasets in PrunaDataModule that allows us to give a number to limit the dataset. If you still think also having a number rather than a fraction here makes more sense I am happy to change it, what do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting it to fixed numbers is nicer as we have more control and awareness surrounding the number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have also added this feature 🧡🧡

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one minal remark but feel free to merge after.

@begumcig begumcig changed the title feat: add TinyCIFAR10 dataset for lightweight experiments feat: add tiny datasets for lightweight experiments Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants