Skip to content

Add custom fingerprint support to from_generator #7533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

simonreise
Copy link

This PR adds dataset_id_suffix parameter to 'Dataset.from_generator' function.

Dataset.from_generator function passes all of its arguments to BuilderConfig.create_config_id, including generator function itself. BuilderConfig.create_config_id function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.

This PR allows user to pass a custom fingerprint (dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.

This PR is a possible solution of #7513

@lhoestq
Copy link
Member

lhoestq commented Apr 24, 2025

This is great !

What do you think of passing config_id= directly to the builder instead of just the suffix ? This would be a power user argument though, or for internal use. And in from_generator the new argument can be fingerprint= as in Dataset.__init__()

The config_id can be defined using something like config_id = "default-fingerprint=" + fingerprint

I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use fingerprint=

@simonreise simonreise force-pushed the user-fingerprint-for-generator branch from 0d4d22a to cafae09 Compare May 4, 2025 19:07
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more comments

@@ -141,6 +141,7 @@ def create_config_id(
self,
config_kwargs: dict,
custom_features: Optional[Features] = None,
fingerprint: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to modify create_config_id() imo, you can just use the user-defined config_id in the __init__

and in src/datasets/io/generator.py you can pass Generator(..., config_id=fingerpring)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I renamed fingerprint to config_id in Generator and DatasetBuilder. I removed the changes from create_config_id, but now I set config_id not in __init__, but in _create_builder_config, because we anyway have to stop this function from generating config_id from hash if we have custom config_id, and we also can use builder_config.name as default name in config_id = "default-fingerprint=" + fingerprint. I can move it to init, if needed.

@simonreise simonreise force-pushed the user-fingerprint-for-generator branch from cafae09 to 726e16d Compare May 8, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants