Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context #1114

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 29, 2025

Purpose

Changes

  • Remove torch.cuda.empty_cache() in run_calibration_forward, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead)
  • Use calibration_forward_context in smoothquant and quantization modifier
  • Remove use of torch.cuda.empty_cache() by smoothquant modiifier

Testing

  • Performed memory analysis with and without torch.cuda.empty_cache and calibration_forward_context independently

Smooth Quant

20c0e104-2353-4a09-9556-f953075205d2

Quantization Modifier

0a0451e2-108e-40fb-be5c-e9619928ab67

It was also found that removing the empty_cache calls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78%

Before

512/512 [03:18<00:00,  2.58it/s]
Duration: 199.38174653053284

After

512/512 [00:42<00:00, 11.91it/s]
Duration: 44.374401807785034

Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs self-assigned this Jan 29, 2025
@kylesayrs kylesayrs added the ready When a PR is ready for review label Jan 29, 2025
rahul-tuli
rahul-tuli previously approved these changes Jan 29, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we run smoothquant with this change? I have certainly come across cases where we run into OOM without this line (even though I know this shouldn't alleviate the issue), I also saw that error go away when CUDA_LAUNCH_BLOCKING env variable was set. I'm good with this change as long as you've verified a smoothquant run! Thanks for investigation

@kylesayrs
Copy link
Collaborator Author

@rahul-tuli That's a good enough reason to wait until some regression tests are finished. We should figure out why OOM occurs and potentially add that to the device map/ fix memory leaks

@kylesayrs kylesayrs marked this pull request as draft January 29, 2025 17:01
@kylesayrs kylesayrs removed the ready When a PR is ready for review label Jan 29, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used by more than just smoothquant so we would need to do regression testing for modifiers outside of just smoothquant before making this change, if we’re seeing the same effects as what Rahul has described. I would update the PR title/description to reflect this.

smootbquant has its own empty cache call which is not targeted by this PR

@kylesayrs
Copy link
Collaborator Author

Yep, the title is a bit of a misnomer is more in reference to "fixing smoothquant" rather than smoothquant-specific implementation

@kylesayrs kylesayrs changed the title [Smoothquant] Remove empty cache call Remove empty cache call from calibration forward Jan 30, 2025
@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Feb 5, 2025

Here's the results of memory profiling SmoothQuant under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph ![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592)

Standard w/ calib_context: 10047127552
Smoothquant: 10584129536
SQ w/o empty_cache: 10584129536
SQ w/o empty_cache, w/ calib_context: 10047258624

Standard w/ calib_context
# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Feb 5, 2025

Here's the results of memory profiling Quantization Modifier under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph ![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e)

Standard w/ calib_context: 10047127552
Quantization Modifier: 11243553901
QM w/o empty_cache: 11243553901
QM w/o empty_cache, w/ calib_context: 10047128685

Standard w/ calib_context
# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

@kylesayrs
Copy link
Collaborator Author

From this brief analysis, I believe it's safe to conclude that

  1. Removing torch.cuda.empty_cache does not lead to increased peak memory usage
  2. Adding calibration_forward_context does decrease peak memory usage (specifically due to having disabled the HF KV cache)

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs added the ready When a PR is ready for review label Feb 5, 2025
@kylesayrs kylesayrs marked this pull request as ready for review February 5, 2025 19:58
@kylesayrs kylesayrs changed the title Remove empty cache call from calibration forward SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context Feb 5, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you resolve conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when quantizing LLama 3.3 70b to FP8 SmoothQuant failing on multi-gpus
3 participants