-
Notifications
You must be signed in to change notification settings - Fork 68
feat: add tiny datasets for lightweight experiments #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add setup_tiny_cifar10_dataset() function in datasets/image.py - Register TinyCIFAR10 in base_datasets with image_classification_collate - Add test case for TinyCIFAR10 in test_datamodule.py - Dataset contains <1,000 samples (600 train + ~200 val + 200 test) - Follows same pattern as existing CIFAR10 implementation Resolves #358
- Add comprehensive docstrings explaining 'img' to 'image' column rename - Clarify compatibility requirement with image_classification_collate function - Document expected output schema with column names and types - Explain this is NOT a breaking change but a necessary compatibility fix The column rename ensures CIFAR-10 datasets work seamlessly with Pruna's image_classification_collate function which expects 'image' column name.
src/pruna/data/datasets/image.py
Outdated
| train_ds = train_ds.rename_column("img", "image") | ||
| test_ds = test_ds.rename_column("img", "image") | ||
|
|
||
| tiny_train = train_ds.select(range(600)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we just getting a this specific smaller subset? Can't we generalise this approach across all datasets and create general logic for getting tiny versions? perhaps to be tackled in a seperate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this makes a lot of sense actually!
src/pruna/data/__init__.py
Outdated
| "image_classification_collate", | ||
| {"img_size": 32}, | ||
| ), | ||
| "TinyCIFAR10": (setup_tiny_cifar10_dataset, "image_classification_collate", {"img_size": 32}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see us re-using somethign like get_tiny(setup_cifar10_dataset) or something.
src/pruna/data/__init__.py
Outdated
| "TinyCIFAR10": (partial(setup_cifar10_dataset, fraction=0.1), "image_classification_collate", {"img_size": 32}), | ||
| "TinyMNIST": (partial(setup_mnist_dataset, fraction=0.1), "image_classification_collate", {"img_size": 28}), | ||
| "TinyImageNet": (partial(setup_imagenet_dataset, fraction=0.1), "image_classification_collate", {"img_size": 224}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Just not 100% sure if a fraction for each of these datasets is small enough, and it is clear how many samples we get now? We could also allow a range/number or something. Not sure if that would be better, but otherwise we can keep it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely see your point here, I only did fractions since we have limit_datasets in PrunaDataModule that allows us to give a number to limit the dataset. If you still think also having a number rather than a fraction here makes more sense I am happy to change it, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting it to fixed numbers is nicer as we have more control and awareness surrounding the number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have also added this feature 🧡🧡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one minal remark but feel free to merge after.
Description
This PR introduces the Tiny CIFAR dataset.
The core implementation was contributed by kris70lesgo in PR #368 and this branch brings their work into the main PrunaAI repository. I made minor adjustments (styling, tests) to align with our current codebase and standards.
Full credit for the original implementation goes to @kris70lesgo 💜💜💜 Thanks a lot for your amazing contribution!
Related Issue
Fixes #(issue number)
Type of Change
How Has This Been Tested?
Checklist
Additional Notes