-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add custom fingerprint support to from_generator
#7533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add custom fingerprint support to from_generator
#7533
Conversation
This is great ! What do you think of passing The I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use |
0d4d22a
to
cafae09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more comments
src/datasets/builder.py
Outdated
@@ -141,6 +141,7 @@ def create_config_id( | |||
self, | |||
config_kwargs: dict, | |||
custom_features: Optional[Features] = None, | |||
fingerprint: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to modify create_config_id() imo, you can just use the user-defined config_id in the __init__
and in src/datasets/io/generator.py you can pass Generator(..., config_id=fingerpring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I renamed fingerprint
to config_id
in Generator and DatasetBuilder. I removed the changes from create_config_id
, but now I set config_id not in __init__
, but in _create_builder_config
, because we anyway have to stop this function from generating config_id from hash if we have custom config_id, and we also can use builder_config.name
as default name in config_id = "default-fingerprint=" + fingerprint
. I can move it to init, if needed.
cafae09
to
726e16d
Compare
This PR adds
dataset_id_suffix
parameter to 'Dataset.from_generator' function.Dataset.from_generator
function passes all of its arguments toBuilderConfig.create_config_id
, including generator function itself.BuilderConfig.create_config_id
function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.This PR allows user to pass a custom fingerprint (
dataset_id_suffix
) to be used as a suffix in a dataset name instead of the one generated by hashing the args.This PR is a possible solution of #7513