-
Notifications
You must be signed in to change notification settings - Fork 33
WIP: multimodal support #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! 🚀
Couldn't look at everything yet, but here are some first comments, will continue tomorrow
@@ -298,11 +305,7 @@ def run(self) -> None: | |||
raise ValueError(f"Both chosen and rejected loss masking spans must be specified if one is specified.") | |||
|
|||
# route tokenize function | |||
if self._config.dataset.loss_masking_spans is not None: | |||
if self._config.dataset.loss_masking_spans not in dataset.column_names: | |||
raise ValueError(f"Dataset does not have spans field '{self._config.dataset.loss_masking_spans}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i moved everything to a single tokenize function, since the combinations were getting too much (images, loss spans, audio next)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still keep the check of the column-names?
Co-authored-by: sohamparikh <[email protected]>
Co-authored-by: RaymondLi0 <[email protected]>
…M into soham/pixtral-support
@@ -548,6 +563,350 @@ def _get_mlp_converters(self, fast_llm_prefix: str, hf_prefix: str) -> list[Weig | |||
] | |||
|
|||
|
|||
class PixtralHuggingfaceCheckpointHandler(WeightAndBiasConverterMixin, HuggingfaceStateDictCheckpointHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that this class handles conversion for the vision encoder (to a PixtralVisionModel
model),
whereas LlavaHuggingfaceCheckpointHandler
is the class that handles conversion of the full model, and is the one to use to convert pixtral-12b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly! It's inspired by the config.json on HF, and increasingly more omni models seem to be converging to a similar format.
✨ Description
Please provide a brief summary of the changes, relevant motivation, and context.
Include any related issue numbers or links to discussions, and explain why this change is necessary.
Closes #
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.