Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Callbacks] Consolidate Saving Methods #1168

Merged
merged 10 commits into from
Feb 25, 2025
Merged

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Feb 18, 2025

Purpose

Background

All the things needed to be done during saving

  1. Save the model weights, potentially compressed
  2. Save the processor
  3. Update the recipe checkpoint
  4. Copy any necessary python files from the model cache
  5. Only save on the main process

After these changes, (1, 2, 3, 4) will be done within the save_pretrained function, and (5) will be the responsibility of the caller. (3) will be implemented by #1160 so as not to conflict with existing logic in pre_init

All of the places where a model is saved are

  • If an output dir is specified, at the end of the main function
  • Between stages of the stage runner
  • Between epochs of the HF Trainer
  • By the user after oneshot/training completes

After these changes, all of these will be replaced by a single save_checkpoint function which calls save_pretrained to do all the necessary things

Changes

  • Remove save_model_and_recipe
    • Saving recipes is now done by save_pretrained function
  • Implement save_checkpoint
    • Single entrypoint for saving a model and its processor
    • Performs actions (1, 2, 4)
  • Replace all locations where a model is saved with save_checkpoint
    • All applicable callers with only saving on the main process (5)
  • Remove support for modify_fsdp_model_save_pretrained and unwrap_and_export_model, to be added back in a future release

Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs added the ready When a PR is ready for review label Feb 18, 2025
@kylesayrs kylesayrs changed the title Consolidate Saving Methods [Callbacks] Consolidate Saving Methods Feb 19, 2025
@kylesayrs kylesayrs marked this pull request as ready for review February 19, 2025 00:23
@kylesayrs kylesayrs self-assigned this Feb 19, 2025
Copy link
Collaborator

@horheynm horheynm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one central location to carry out saving logic sounds great!

Could you map when the current saving logic is carried out using which pathway, and how the new changes takes over the saving logic? Ex. when do the many different saving logic now gets carried out?

For FSDP, we do support it currently. Once stage runner is removed, then the assumption that any oneshot pathway will not have fsdp support will be valid.

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Feb 21, 2025

@horheynm All the answers to your questions are in the PR description. w.r.t. fsdp, it is not supported now but will be at a later date (soon).

horheynm
horheynm previously approved these changes Feb 24, 2025
@kylesayrs
Copy link
Collaborator Author

@dsikka This is ready for merge

@kylesayrs kylesayrs dismissed stale reviews from brian-dellabetta and horheynm via f2411ed February 24, 2025 21:37
@dsikka
Copy link
Collaborator

dsikka commented Feb 25, 2025

Can you fix conflicts

Copy link
Collaborator

@horheynm horheynm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
One clarification question.

@dsikka dsikka merged commit 6e101b2 into main Feb 25, 2025
7 checks passed
@dsikka dsikka deleted the kylesayrs/consolidate-saving branch February 25, 2025 15:46
dsikka pushed a commit that referenced this pull request Feb 26, 2025
## Purpose ##
* Remove pre_initialize_structure to simplify codebase
* Fix recipe appending for appending a recipe to a model which already
has a recipe
* Remove misleading logging messages

```
2025-02-17T17:48:38.477750-0500 | _check_create_state | INFO - State created for compression lifecycle
2025-02-17T17:48:38.478670-0500 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2025-02-17T17:48:38.478836-0500 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
```

## Prerequisites ##
* #1168

## Follow-ups ##
* Remove double initialization

## Changes ##
The preinitialization step used to fulfill a few purposes
* Construct the lifecycle state
  * This is now done by the dataclass directly
```python3
- state: Optional[State] = None
+ state: Optional[State] = field(default_factory=State)
```

* Populate state with model and recipe
  * This is now done (and has always been done) by `initialize`
* Some functions such as Trainer.init_model attempt to access the model
through the session before `initialize` is called. In these cases, we
can pass the model directly
```python3
trainer = Trainer(
-     model_init=get_session_model,
+     model_init=lambda: model,
```

* Prepend recipes to the recipe.yaml if the model has already been
compressed once
* Move this logic from preinitialization to the save_pretrained function
  * Consolidate all save pathways to use the the same wrapped method

```python3
def save_pretrained_wrapper(...):
    update_and_save_recipe(model.name_or_path, save_directory)
```

* Provide a way for modifiers to influence the model after they have
already been applied
* This can still be a enacted via recipe validation, but likely no
longer has a use case and shouldn't be done automatically, at most the
LLM Compressor should warn if the recipe configuration is invalid /
requires modification

* Create quantization modifier on GPTQ
  * This is now done within the `on_initialize` function
* In the future, this should be done by a high-level recipe validation
step
```python3
def on_initialize(...)
-     self.on_initialize_structure(state, **kwargs)
+     self._maybe_build_quant_modifier(state.model)
````

* Remove `EventType.order()` method which is unused

* Extend the `Recipe.simplify_recipe` class method to support strings

## Lifecycle ##
1. `create_session()` (doesn't do much and can be hidden behind
`initialize`)
2. `initialize(model=..., recipe=...)`
    1. Maybe `start` modifiers
3. `LifecycleCallback.event(...)`
    1. Maybe `start/end` modifiers
4. `finalize()`

## Regression Evaluation ##
Main
```
vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-Instruct-W4A16-G128,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|----------|------:|------|-----:|------|---|-----:|---|-----:|
|winogrande|      1|none  |     5|acc   |↑  |0.7482|±  |0.0122|
```

This branch
```
vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-Instruct-W4A16-G128,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|----------|------:|------|-----:|------|---|-----:|---|-----:|
|winogrande|      1|none  |     5|acc   |↑  |0.7482|±  |0.0122|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants