Split up `generate_data` and add a `mix_datasets` top level API #443

bbrowning · 2024-12-10T21:26:22Z

This separates out instructlab.sdg.generate_data into separate functions, where generate_data just calls into these separate functions instead of containing all the logic in itself directly.

This enables us to expose a new top-level API, mix_datasets, to allow users to mix datasets. It also lays the groundwork for how we might split some of the pre-processing and post-processing out into separate modules or other repositories, as now there's a clean separation between the various concerns that were previously all intertwined within generate_data.

bbrowning · 2024-12-10T21:33:07Z

@jwm4 Does something like this help with some of your needs? A top-level API to only do the preprocessing steps but not any data generation?

cdoern · 2025-01-03T02:49:33Z

This adds a simple CLI usable within SDG itself to do all taxonomy preprocessing via:
python -m instructlab.sdg.cli.taxonomy_to_samples --taxonomy-path /path/to/my/taxonomy/ --output-dir /path/to/my/output/samples --log-level DEBUG

Does this confuse the user story a bit of what we expect to be a "user facing" entry point to InstructLab? or is this CLI just meant to be a sort of dev environment for quick use? If its the later, then this makes sense.

However, I do think it would make sense to instead focus on adding this CLI functionality into an ilab data process command. That could be as simple now as adding the proper plumbing to call whatever whatever API is being invoked via the new CLI being introduced here.

With changes like this that expose a new user entry point we risk divergence and decreased functionality in ilab due to new usage patterns that the dev team might more frequently use rather than what everyday users will be running

cdoern

one comment related to the above

src/instructlab/sdg/cli/preprocess_taxonomy.py

bbrowning · 2025-01-10T16:09:55Z

I believe this is ready for larger review. Some particular things I'd like to call out for reviewers to think about:

The only other top-level API we expose here via __init__.py is mix_datasets. Does the Python API of mix_datasets look reasonable? This is one we'd want to support for advanced users that need to adjust how generated data gets mixed.

The old monolithic generate_data was split into preprocess_taxonomy, generate_taxonomy, generate_taxonomy_eval, postprocess_taxonomy, and mix_datasets. Are these a reasonable level of granularity, at least in a first pass of splitting things up? For backwards-compatibility we retained a generate_data method that just calls all of the above in turn to keep the same behavior as before. None of these except mix_datasets are exposed as top-level importable functions from instructlab-sdg. So, we can iterate on naming and further splitting as needed in future releases unless there's strong disagreement here.

We now write more intermediate files to disk between stages, including additional metadata (ie columns) in these jsonl files to keep track of things like whether a sample generated from the taxonomy was a knowledge, freeform, or grounded skill as we use that to differentiate which pipeline to run. Are the leaf_node_type and leaf_node_path column names reasonable to write into our sample files for this purpose?

Lastly, we now do all the preprocessing for all leaf nodes before moving on to generation. We do all generation for all leaf nodes before moving on to postprocessing. Previously, we operated on a single leaf node at a time end-to-end. In other words, previously we'd preprocess, generate, and postprocess leaf node A. Then we'd do the same for leaf node B. Now, we preprocess leaf nodes A & B, then generate for A & B, and finally postprocess A & B. This was required for decoupling things, and sets us up for some future scenarios where we'll need to entirely finish one stage of our processing before moving on to the next because some stages may use different models, such as subset selection with an embedding model. Any concerns about moving to this approach?

Obviously I'd love feedback for any/all parts of the PR, but just calling out the above as some places to spend a bit more focused time during a review.

src/instructlab/sdg/generate_data.py

aakankshaduggal

Overall lgtm, great work here @bbrowning
I have minor comment, but we don't have to get it in with this PR.

bbrowning · 2025-01-15T18:16:57Z

I like the suggestion @aakankshaduggal , so incorporating that directly here. Thanks!

mergify · 2025-01-15T18:22:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This doesn't move things out into separate files yet, but it does split the existing functionality of `generate_date` into multiple discrete steps and changes `generate_date` to just call those steps. This is a step towards cleaner separation between the steps and creating top-level Python APIs for each discrete step for advanced use-cases that don't just want an entire single step generation pipeline. Signed-off-by: Ben Browning <[email protected]>

Instead of hardcoding this to always be 3, add a parameter with a default of 3 when converting our seed examples to the test output dataset. Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This adds a new docs/examples/mix_datasets folders with a couple of example recipes, two sample datasets, and an example_mixing.py Python script to show how to mix datasets. This also adds a test_examples.py file that actually runs out examples, ensuring they work without error and generate the expected mixed datasets. Signed-off-by: Ben Browning <[email protected]>

bbrowning · 2025-01-16T16:02:06Z

Rebased this to fix a merge conflict, added a few examples of using the mixing API, along with a test to ensure those examples work as expected.

cdoern

approving from a CLI point of view. We can iterate on the public APIs and when they should move to core.

src/instructlab/sdg/generate_data.py

mergify bot added the testing Relates to testing label Dec 10, 2024

bbrowning force-pushed the separation-of-concerns branch from 193fc30 to a80a3f7 Compare December 10, 2024 23:26

bbrowning marked this pull request as draft December 11, 2024 14:13

bbrowning mentioned this pull request Dec 11, 2024

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

Open

4 tasks

mergify bot added the ci-failure label Dec 11, 2024

bbrowning force-pushed the separation-of-concerns branch from 146b25a to b17b08d Compare December 11, 2024 21:36

mergify bot removed the ci-failure label Dec 11, 2024

bbrowning changed the title ~~Add a new instructlab.sdg.taxonomy_to_samples API~~ Separate generate_data into multiple supported top-level APIs and CLIs Dec 11, 2024

bbrowning changed the title ~~Separate generate_data into multiple supported top-level APIs and CLIs~~ Split generate_data into multiple top-level APIs and CLIs Dec 11, 2024

cdoern reviewed Jan 3, 2025

View reviewed changes

src/instructlab/sdg/cli/preprocess_taxonomy.py Outdated Show resolved Hide resolved

mergify bot added the ci-failure label Jan 6, 2025

bbrowning force-pushed the separation-of-concerns branch from 8703b9f to 7e40219 Compare January 6, 2025 19:42

mergify bot added ci-failure and removed ci-failure labels Jan 6, 2025

bbrowning changed the title ~~Split generate_data into multiple top-level APIs and CLIs~~ Split generate_data into multiple top-level APIs Jan 7, 2025

bbrowning changed the title ~~Split generate_data into multiple top-level APIs~~ Split up generate_data and add a mix_datasets top level API Jan 7, 2025

bbrowning force-pushed the separation-of-concerns branch from 1d8c077 to f38c79a Compare January 7, 2025 21:07

mergify bot added ci-failure and removed ci-failure labels Jan 7, 2025

bbrowning force-pushed the separation-of-concerns branch from f38c79a to f84902f Compare January 7, 2025 21:17

mergify bot removed the ci-failure label Jan 7, 2025

bbrowning marked this pull request as ready for review January 10, 2025 15:58

bbrowning requested a review from a team January 10, 2025 16:10

aakankshaduggal reviewed Jan 15, 2025

View reviewed changes

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved

aakankshaduggal reviewed Jan 15, 2025

View reviewed changes

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved

aakankshaduggal approved these changes Jan 15, 2025

View reviewed changes

mergify bot added the one-approval label Jan 15, 2025

mergify bot added the needs-rebase label Jan 15, 2025

bbrowning and others added 2 commits January 15, 2025 13:58

bbrowning force-pushed the separation-of-concerns branch from 69dd254 to 6c8544e Compare January 15, 2025 19:11

mergify bot added documentation Improvements or additions to documentation ci-failure and removed needs-rebase labels Jan 15, 2025

bbrowning force-pushed the separation-of-concerns branch from 5e32b29 to 1f9394d Compare January 16, 2025 14:07

mergify bot removed the ci-failure label Jan 16, 2025

cdoern approved these changes Jan 16, 2025

View reviewed changes

src/instructlab/sdg/generate_data.py Show resolved Hide resolved

mergify bot merged commit 695c651 into instructlab:main Jan 16, 2025
22 checks passed

mergify bot removed the one-approval label Jan 16, 2025

bbrowning deleted the separation-of-concerns branch January 16, 2025 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up `generate_data` and add a `mix_datasets` top level API #443

Split up `generate_data` and add a `mix_datasets` top level API #443

bbrowning commented Dec 10, 2024 •

edited

Loading

bbrowning commented Dec 10, 2024

cdoern commented Jan 3, 2025 •

edited

Loading

cdoern left a comment

bbrowning commented Jan 10, 2025

aakankshaduggal left a comment

bbrowning commented Jan 15, 2025

mergify bot commented Jan 15, 2025

bbrowning commented Jan 16, 2025

cdoern left a comment

Split up generate_data and add a mix_datasets top level API #443

Split up generate_data and add a mix_datasets top level API #443

Conversation

bbrowning commented Dec 10, 2024 • edited Loading

bbrowning commented Dec 10, 2024

cdoern commented Jan 3, 2025 • edited Loading

cdoern left a comment

Choose a reason for hiding this comment

bbrowning commented Jan 10, 2025

aakankshaduggal left a comment

Choose a reason for hiding this comment

bbrowning commented Jan 15, 2025

mergify bot commented Jan 15, 2025

bbrowning commented Jan 16, 2025

cdoern left a comment

Choose a reason for hiding this comment

Split up `generate_data` and add a `mix_datasets` top level API #443

Split up `generate_data` and add a `mix_datasets` top level API #443

bbrowning commented Dec 10, 2024 •

edited

Loading

cdoern commented Jan 3, 2025 •

edited

Loading