-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQ and QM: Remove torch.cuda.empty_cache
, use calibration_forward_context
#1114
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we run smoothquant with this change? I have certainly come across cases where we run into OOM without this line (even though I know this shouldn't alleviate the issue), I also saw that error go away when CUDA_LAUNCH_BLOCKING env variable was set. I'm good with this change as long as you've verified a smoothquant run! Thanks for investigation
@rahul-tuli That's a good enough reason to wait until some regression tests are finished. We should figure out why OOM occurs and potentially add that to the device map/ fix memory leaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used by more than just smoothquant so we would need to do regression testing for modifiers outside of just smoothquant before making this change, if we’re seeing the same effects as what Rahul has described. I would update the PR title/description to reflect this.
smootbquant has its own empty cache call which is not targeted by this PR
Yep, the title is a bit of a misnomer is more in reference to "fixing smoothquant" rather than smoothquant-specific implementation |
Here's the results of memory profiling SmoothQuant under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048). graph![20c0e104-2353-4a09-9556-f953075205d2](https://github.com/user-attachments/assets/a6727da5-8350-449b-82b6-eff8f6d3d592)Standard w/ calib_context: 10047127552 Standard w/ calib_context# standard forward pass
with calibration_forward_context(model):
for batch in tqdm.tqdm(dataloader):
model(**batch) |
Here's the results of memory profiling Quantization Modifier under various implementations. Constants are the model used (llama3.2-1b-instruct) and the batch_size (16x2048). graph![0a0451e2-108e-40fb-be5c-e9619928ab67](https://github.com/user-attachments/assets/325c2124-734f-40eb-ac3b-77debf45389e)Standard w/ calib_context: 10047127552 Standard w/ calib_context# standard forward pass
with calibration_forward_context(model):
for batch in tqdm.tqdm(dataloader):
model(**batch) |
From this brief analysis, I believe it's safe to conclude that
|
Signed-off-by: Kyle Sayers <[email protected]>
torch.cuda.empty_cache
, use calibration_forward_context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you resolve conflicts
Purpose
torch.cuda.empty_cache()
kernel sometimes fails to launch. Given thatempty_cache
does not actually free memory that wouldn't have already been freed by the python garbage collector + pytorch caching allocator, it should be safe to remove this call.Changes
torch.cuda.empty_cache()
inrun_calibration_forward
, which only affects smoothquant and quantization modifier (sparsegpt and wanda will soon use sequential pipelines instead)calibration_forward_context
in smoothquant and quantization modifiertorch.cuda.empty_cache()
by smoothquant modiifierTesting
torch.cuda.empty_cache
andcalibration_forward_context
independentlySmooth Quant
Quantization Modifier
It was also found that removing the
empty_cache
calls in between each operation reduced the runtime of Quantization Modifier on llama3-8B by 78%Before
After