-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up generate_data
and add a mix_datasets
top level API
#443
Conversation
@jwm4 Does something like this help with some of your needs? A top-level API to only do the preprocessing steps but not any data generation? |
193fc30
to
a80a3f7
Compare
146b25a
to
b17b08d
Compare
instructlab.sdg.taxonomy_to_samples
APIgenerate_data
into multiple supported top-level APIs and CLIs
generate_data
into multiple supported top-level APIs and CLIsgenerate_data
into multiple top-level APIs and CLIs
Does this confuse the user story a bit of what we expect to be a "user facing" entry point to InstructLab? or is this CLI just meant to be a sort of dev environment for quick use? If its the later, then this makes sense. However, I do think it would make sense to instead focus on adding this CLI functionality into an With changes like this that expose a new user entry point we risk divergence and decreased functionality in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment related to the above
8703b9f
to
7e40219
Compare
generate_data
into multiple top-level APIs and CLIsgenerate_data
into multiple top-level APIs
generate_data
into multiple top-level APIsgenerate_data
and add a mix_datasets
top level API
1d8c077
to
f38c79a
Compare
f38c79a
to
f84902f
Compare
I believe this is ready for larger review. Some particular things I'd like to call out for reviewers to think about: The only other top-level API we expose here via The old monolithic We now write more intermediate files to disk between stages, including additional metadata (ie columns) in these jsonl files to keep track of things like whether a sample generated from the taxonomy was a knowledge, freeform, or grounded skill as we use that to differentiate which pipeline to run. Are the Lastly, we now do all the preprocessing for all leaf nodes before moving on to generation. We do all generation for all leaf nodes before moving on to postprocessing. Previously, we operated on a single leaf node at a time end-to-end. In other words, previously we'd preprocess, generate, and postprocess leaf node A. Then we'd do the same for leaf node B. Now, we preprocess leaf nodes A & B, then generate for A & B, and finally postprocess A & B. This was required for decoupling things, and sets us up for some future scenarios where we'll need to entirely finish one stage of our processing before moving on to the next because some stages may use different models, such as subset selection with an embedding model. Any concerns about moving to this approach? Obviously I'd love feedback for any/all parts of the PR, but just calling out the above as some places to spend a bit more focused time during a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm, great work here @bbrowning
I have minor comment, but we don't have to get it in with this PR.
I like the suggestion @aakankshaduggal , so incorporating that directly here. Thanks! |
This pull request has merge conflicts that must be resolved before it can be |
This doesn't move things out into separate files yet, but it does split the existing functionality of `generate_date` into multiple discrete steps and changes `generate_date` to just call those steps. This is a step towards cleaner separation between the steps and creating top-level Python APIs for each discrete step for advanced use-cases that don't just want an entire single step generation pipeline. Signed-off-by: Ben Browning <[email protected]>
Instead of hardcoding this to always be 3, add a parameter with a default of 3 when converting our seed examples to the test output dataset. Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Ben Browning <[email protected]>
69dd254
to
6c8544e
Compare
This adds a new docs/examples/mix_datasets folders with a couple of example recipes, two sample datasets, and an example_mixing.py Python script to show how to mix datasets. This also adds a test_examples.py file that actually runs out examples, ensuring they work without error and generate the expected mixed datasets. Signed-off-by: Ben Browning <[email protected]>
5e32b29
to
1f9394d
Compare
Rebased this to fix a merge conflict, added a few examples of using the mixing API, along with a test to ensure those examples work as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving from a CLI point of view. We can iterate on the public APIs and when they should move to core.
This separates out
instructlab.sdg.generate_data
into separate functions, wheregenerate_data
just calls into these separate functions instead of containing all the logic in itself directly.This enables us to expose a new top-level API,
mix_datasets
, to allow users to mix datasets. It also lays the groundwork for how we might split some of the pre-processing and post-processing out into separate modules or other repositories, as now there's a clean separation between the various concerns that were previously all intertwined withingenerate_data
.