Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: HuggingFace Dataset Automatically Expand Dictionary Keys #325

Open
richardzhuang0412 opened this issue Jan 8, 2025 · 0 comments
Open

Comments

@richardzhuang0412
Copy link

As I'm creating demos using curator for fine-tuning function-calling models, I found that Huggingface Dataset seems to automatically process the dictionary entries. Specifically, it will impute keys with None value so that all entries in the column has the same keys I guess for compatibility.

Function-Calling Example:

from datasets import Dataset

ls = [{'name': 'land_drone', 'arguments': {'location': 'current'}},
 {'name': 'land_drone', 'arguments': {'location': 'home_base'}},
 {'name': 'land_drone', 'arguments': {'location': 'custom'}},
 {'name': 'control_camera', 'arguments': {'mode': 'photo'}},
 {'name': 'control_camera', 'arguments': {'mode': 'video'}},
 {'name': 'control_camera', 'arguments': {'mode': 'panorama'}},
 {'name': 'set_drone_lighting', 'arguments': {'mode': 'on'}},
 {'name': 'set_drone_lighting', 'arguments': {'mode': 'off'}},
 {'name': 'set_drone_lighting', 'arguments': {'mode': 'blink'}},
 {'name': 'set_drone_lighting', 'arguments': {'mode': 'sos'}},
 {'name': 'return_to_home', 'arguments': {}},
 {'name': 'set_battery_saver_mode', 'arguments': {'status': 'on'}},
 {'name': 'set_battery_saver_mode', 'arguments': {'status': 'off'}},
 {'name': 'set_obstacle_avoidance', 'arguments': {'mode': 'on'}},
 {'name': 'set_obstacle_avoidance', 'arguments': {'mode': 'off'}},
 {'name': 'set_follow_me_mode', 'arguments': {'status': 'on'}},
 {'name': 'set_follow_me_mode', 'arguments': {'status': 'off'}},
 {'name': 'calibrate_sensors', 'arguments': {}},
 {'name': 'set_autopilot', 'arguments': {'status': 'on'}},
 {'name': 'set_autopilot', 'arguments': {'status': 'off'}},
 {'name': 'configure_led_display', 'arguments': {'pattern': 'solid'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'solid', 'color': 'red'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'solid', 'color': 'blue'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'solid', 'color': 'green'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'solid', 'color': 'yellow'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'solid', 'color': 'white'}},
 {'name': 'configure_led_display', 'arguments': {'pattern': 'blink'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'blink', 'color': 'red'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'blink', 'color': 'blue'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'blink', 'color': 'green'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'blink', 'color': 'yellow'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'blink', 'color': 'white'}},
 {'name': 'configure_led_display', 'arguments': {'pattern': 'pulse'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'pulse', 'color': 'red'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'pulse', 'color': 'blue'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'pulse', 'color': 'green'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'pulse', 'color': 'yellow'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'pulse', 'color': 'white'}},
 {'name': 'configure_led_display', 'arguments': {'pattern': 'rainbow'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'rainbow', 'color': 'red'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'rainbow', 'color': 'blue'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'rainbow', 'color': 'green'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'rainbow', 'color': 'yellow'}},
 {'name': 'configure_led_display',
  'arguments': {'pattern': 'rainbow', 'color': 'white'}},
 {'name': 'reject_request', 'arguments': {}}
 ]

ls_str = [str(item) for item in ls]

ds = Dataset.from_dict({"function_call": ls})
ds_str = Dataset.from_dict({"function_call": ls_str})

print(ds[0])
print(ds_str[0])

Output:

{'function_call': {'arguments': {'color': None, 'location': 'current', 'mode': None, 'pattern': None, 'status': None}, 'name': 'land_drone'}}
{'function_call': "{'name': 'land_drone', 'arguments': {'location': 'current'}}"}

Since curator deals with Dataset objects throughout, for function-calling settings this could be a potential issue as the argument key can be expanded with arguments from all other unrelated functions and mess up fine-tuning.

@richardzhuang0412 richardzhuang0412 changed the title Bug: HuggingFace Dataset Automatically Bug: HuggingFace Dataset Automatically Expand Dictionary Keys Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant