Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up generate_data and add a mix_datasets top level API #443

Merged
merged 3 commits into from
Jan 16, 2025

Conversation

bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Dec 10, 2024

This separates out instructlab.sdg.generate_data into separate functions, where generate_data just calls into these separate functions instead of containing all the logic in itself directly.

This enables us to expose a new top-level API, mix_datasets, to allow users to mix datasets. It also lays the groundwork for how we might split some of the pre-processing and post-processing out into separate modules or other repositories, as now there's a clean separation between the various concerns that were previously all intertwined within generate_data.

@mergify mergify bot added the testing Relates to testing label Dec 10, 2024
@bbrowning
Copy link
Contributor Author

@jwm4 Does something like this help with some of your needs? A top-level API to only do the preprocessing steps but not any data generation?

@bbrowning bbrowning force-pushed the separation-of-concerns branch from 193fc30 to a80a3f7 Compare December 10, 2024 23:26
@bbrowning bbrowning marked this pull request as draft December 11, 2024 14:13
@mergify mergify bot added the ci-failure label Dec 11, 2024
@bbrowning bbrowning force-pushed the separation-of-concerns branch from 146b25a to b17b08d Compare December 11, 2024 21:36
@mergify mergify bot removed the ci-failure label Dec 11, 2024
@bbrowning bbrowning changed the title Add a new instructlab.sdg.taxonomy_to_samples API Separate generate_data into multiple supported top-level APIs and CLIs Dec 11, 2024
@bbrowning bbrowning changed the title Separate generate_data into multiple supported top-level APIs and CLIs Split generate_data into multiple top-level APIs and CLIs Dec 11, 2024
@cdoern
Copy link
Contributor

cdoern commented Jan 3, 2025

This adds a simple CLI usable within SDG itself to do all taxonomy preprocessing via:
python -m instructlab.sdg.cli.taxonomy_to_samples --taxonomy-path /path/to/my/taxonomy/ --output-dir /path/to/my/output/samples --log-level DEBUG

Does this confuse the user story a bit of what we expect to be a "user facing" entry point to InstructLab? or is this CLI just meant to be a sort of dev environment for quick use? If its the later, then this makes sense.

However, I do think it would make sense to instead focus on adding this CLI functionality into an ilab data process command. That could be as simple now as adding the proper plumbing to call whatever whatever API is being invoked via the new CLI being introduced here.

With changes like this that expose a new user entry point we risk divergence and decreased functionality in ilab due to new usage patterns that the dev team might more frequently use rather than what everyday users will be running

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment related to the above

src/instructlab/sdg/cli/preprocess_taxonomy.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ci-failure label Jan 6, 2025
@bbrowning bbrowning force-pushed the separation-of-concerns branch from 8703b9f to 7e40219 Compare January 6, 2025 19:42
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 6, 2025
@bbrowning bbrowning changed the title Split generate_data into multiple top-level APIs and CLIs Split generate_data into multiple top-level APIs Jan 7, 2025
@bbrowning bbrowning changed the title Split generate_data into multiple top-level APIs Split up generate_data and add a mix_datasets top level API Jan 7, 2025
@bbrowning bbrowning force-pushed the separation-of-concerns branch from 1d8c077 to f38c79a Compare January 7, 2025 21:07
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 7, 2025
@bbrowning bbrowning force-pushed the separation-of-concerns branch from f38c79a to f84902f Compare January 7, 2025 21:17
@mergify mergify bot removed the ci-failure label Jan 7, 2025
@bbrowning bbrowning marked this pull request as ready for review January 10, 2025 15:58
@bbrowning
Copy link
Contributor Author

I believe this is ready for larger review. Some particular things I'd like to call out for reviewers to think about:

The only other top-level API we expose here via __init__.py is mix_datasets. Does the Python API of mix_datasets look reasonable? This is one we'd want to support for advanced users that need to adjust how generated data gets mixed.

The old monolithic generate_data was split into preprocess_taxonomy, generate_taxonomy, generate_taxonomy_eval, postprocess_taxonomy, and mix_datasets. Are these a reasonable level of granularity, at least in a first pass of splitting things up? For backwards-compatibility we retained a generate_data method that just calls all of the above in turn to keep the same behavior as before. None of these except mix_datasets are exposed as top-level importable functions from instructlab-sdg. So, we can iterate on naming and further splitting as needed in future releases unless there's strong disagreement here.

We now write more intermediate files to disk between stages, including additional metadata (ie columns) in these jsonl files to keep track of things like whether a sample generated from the taxonomy was a knowledge, freeform, or grounded skill as we use that to differentiate which pipeline to run. Are the leaf_node_type and leaf_node_path column names reasonable to write into our sample files for this purpose?

Lastly, we now do all the preprocessing for all leaf nodes before moving on to generation. We do all generation for all leaf nodes before moving on to postprocessing. Previously, we operated on a single leaf node at a time end-to-end. In other words, previously we'd preprocess, generate, and postprocess leaf node A. Then we'd do the same for leaf node B. Now, we preprocess leaf nodes A & B, then generate for A & B, and finally postprocess A & B. This was required for decoupling things, and sets us up for some future scenarios where we'll need to entirely finish one stage of our processing before moving on to the next because some stages may use different models, such as subset selection with an embedding model. Any concerns about moving to this approach?

Obviously I'd love feedback for any/all parts of the PR, but just calling out the above as some places to spend a bit more focused time during a review.

@bbrowning bbrowning requested a review from a team January 10, 2025 16:10
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, great work here @bbrowning
I have minor comment, but we don't have to get it in with this PR.

@mergify mergify bot added the one-approval label Jan 15, 2025
@bbrowning
Copy link
Contributor Author

I like the suggestion @aakankshaduggal , so incorporating that directly here. Thanks!

Copy link
Contributor

mergify bot commented Jan 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 15, 2025
bbrowning and others added 2 commits January 15, 2025 13:58
This doesn't move things out into separate files yet, but it does
split the existing functionality of `generate_date` into multiple
discrete steps and changes `generate_date` to just call those steps.

This is a step towards cleaner separation between the steps and
creating top-level Python APIs for each discrete step for advanced
use-cases that don't just want an entire single step generation pipeline.

Signed-off-by: Ben Browning <[email protected]>
Instead of hardcoding this to always be 3, add a parameter with a default of 3 when converting our seed examples to the test output dataset.

Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning force-pushed the separation-of-concerns branch from 69dd254 to 6c8544e Compare January 15, 2025 19:11
@mergify mergify bot added documentation Improvements or additions to documentation ci-failure and removed needs-rebase labels Jan 15, 2025
This adds a new docs/examples/mix_datasets folders with a couple of
example recipes, two sample datasets, and an example_mixing.py Python
script to show how to mix datasets.

This also adds a test_examples.py file that actually runs out
examples, ensuring they work without error and generate the expected
mixed datasets.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning force-pushed the separation-of-concerns branch from 5e32b29 to 1f9394d Compare January 16, 2025 14:07
@mergify mergify bot removed the ci-failure label Jan 16, 2025
@bbrowning
Copy link
Contributor Author

Rebased this to fix a merge conflict, added a few examples of using the mixing API, along with a test to ensure those examples work as expected.

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving from a CLI point of view. We can iterate on the public APIs and when they should move to core.

src/instructlab/sdg/generate_data.py Show resolved Hide resolved
@mergify mergify bot merged commit 695c651 into instructlab:main Jan 16, 2025
22 checks passed
@mergify mergify bot removed the one-approval label Jan 16, 2025
@bbrowning bbrowning deleted the separation-of-concerns branch January 16, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants