SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` #1114

kylesayrs · 2025-01-29T04:01:34Z

Purpose

Fixes SmoothQuant failing on multi-gpus #1081
Fixes Error when quantizing LLama 3.3 70b to FP8 #963
There's really no explanation online as to why the torch.cuda.empty_cache() kernel sometimes fails to launch. Given that empty_cache does not actually free memory that wouldn't have already been freed by the python garbage collector + pytorch caching allocator, it should be safe to remove this call.

Changes

Remove torch.cuda.empty_cache() in run_calibration_forward, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead)
Use calibration_forward_context in smoothquant and quantization modifier
Remove use of torch.cuda.empty_cache() by smoothquant modiifier

Testing

Performed memory analysis with and without torch.cuda.empty_cache and calibration_forward_context independently

Smooth Quant

Quantization Modifier

It was also found that removing the empty_cache calls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78%

Before

512/512 [03:18<00:00,  2.58it/s]
Duration: 199.38174653053284

After

512/512 [00:42<00:00, 11.91it/s]
Duration: 44.374401807785034

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-01-29T04:01:44Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

rahul-tuli

Did we run smoothquant with this change? I have certainly come across cases where we run into OOM without this line (even though I know this shouldn't alleviate the issue), I also saw that error go away when CUDA_LAUNCH_BLOCKING env variable was set. I'm good with this change as long as you've verified a smoothquant run! Thanks for investigation

kylesayrs · 2025-01-29T17:00:55Z

@rahul-tuli That's a good enough reason to wait until some regression tests are finished. We should figure out why OOM occurs and potentially add that to the device map/ fix memory leaks

dsikka

This method is used by more than just smoothquant so we would need to do regression testing for modifiers outside of just smoothquant before making this change, if we’re seeing the same effects as what Rahul has described. I would update the PR title/description to reflect this.

smootbquant has its own empty cache call which is not targeted by this PR

kylesayrs · 2025-01-30T14:20:46Z

Yep, the title is a bit of a misnomer is more in reference to "fixing smoothquant" rather than smoothquant-specific implementation

kylesayrs · 2025-02-05T19:33:34Z

Here's the results of memory profiling SmoothQuant under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph

![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592)

Standard w/ calib_context: 10047127552
Smoothquant: 10584129536
SQ w/o empty_cache: 10584129536
SQ w/o empty_cache, w/ calib_context: 10047258624

Standard w/ calib_context

# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

kylesayrs · 2025-02-05T19:48:44Z

Here's the results of memory profiling Quantization Modifier under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048).

graph

![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e)

Standard w/ calib_context: 10047127552
Quantization Modifier: 11243553901
QM w/o empty_cache: 11243553901
QM w/o empty_cache, w/ calib_context: 10047128685

Standard w/ calib_context

# standard forward pass
with calibration_forward_context(model):
    for batch in tqdm.tqdm(dataloader):
        model(**batch)

kylesayrs · 2025-02-05T19:51:10Z

From this brief analysis, I believe it's safe to conclude that

Removing torch.cuda.empty_cache does not lead to increased peak memory usage
Adding calibration_forward_context does decrease peak memory usage (specifically due to having disabled the HF KV cache)

Signed-off-by: Kyle Sayers <[email protected]>

dsikka

can you resolve conflicts

remove empty cache call

b9bd970

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs self-assigned this Jan 29, 2025

kylesayrs added the ready When a PR is ready for review label Jan 29, 2025

rahul-tuli previously approved these changes Jan 29, 2025

View reviewed changes

kylesayrs marked this pull request as draft January 29, 2025 17:01

kylesayrs removed the ready When a PR is ready for review label Jan 29, 2025

dsikka requested changes Jan 30, 2025

View reviewed changes

kylesayrs changed the title ~~[Smoothquant] Remove empty cache call~~ Remove empty cache call from calibration forward Jan 30, 2025

use calib forward context

aa7496d

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed rahul-tuli’s stale review via aa7496d February 5, 2025 19:54

kylesayrs requested review from rahul-tuli and dsikka February 5, 2025 19:58

kylesayrs added the ready When a PR is ready for review label Feb 5, 2025

kylesayrs marked this pull request as ready for review February 5, 2025 19:58

Merge branch 'main' into kylesayrs/remove_empty_cache

88643cf

kylesayrs changed the title ~~Remove empty cache call from calibration forward~~ SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context Feb 5, 2025

dsikka reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` #1114

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` #1114

kylesayrs commented Jan 29, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025

rahul-tuli left a comment

kylesayrs commented Jan 29, 2025

dsikka left a comment

kylesayrs commented Jan 30, 2025

kylesayrs commented Feb 5, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025

dsikka left a comment

SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context #1114

Are you sure you want to change the base?

SQ and QM: Remove torch.cuda.empty_cache, use calibration_forward_context #1114

Conversation

kylesayrs commented Jan 29, 2025 • edited Loading

Purpose

Changes

Testing

Smooth Quant

Quantization Modifier

github-actions bot commented Jan 29, 2025

rahul-tuli left a comment

Choose a reason for hiding this comment

kylesayrs commented Jan 29, 2025

dsikka left a comment

Choose a reason for hiding this comment

kylesayrs commented Jan 30, 2025

kylesayrs commented Feb 5, 2025 • edited Loading

kylesayrs commented Feb 5, 2025 • edited Loading

kylesayrs commented Feb 5, 2025

dsikka left a comment

Choose a reason for hiding this comment

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` #1114

SQ and QM: Remove `torch.cuda.empty_cache`, use `calibration_forward_context` #1114

kylesayrs commented Jan 29, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025 •

edited

Loading

kylesayrs commented Feb 5, 2025 •

edited

Loading