Skip to content

Conversation

@bayo-ibm
Copy link
Contributor

@bayo-ibm bayo-ibm commented May 19, 2025

Description of the change

This PR enables faster loading of a quantized model by calling only the functions/sub-functions needed to load a model while ignoring the functions needed for quantizing the model. An inference argument was added to the fms_mo argument to activate the function.

we need 2 new functions, which look like:

  1. fp8_model_load( <a fp8 checkpoint by llm-compressor> ), it will load an existing fp8 ckpt into fms-mo and proper quantizers should be configured when possible, i.e. may need to parse quantization block in config.json or equivalent file.
  2. fp8_model_save( <qmodel from fms-mo> ), it will save a compatible fp8 checkpoint that can be consumed by vllm or aiu-compiler.

Related issue number

None

How to verify the PR

The PR is validated by performing Direct Quantization with Smooth Quant and parsing the inference argument with the rest of the argument. The validation was done with/without Qbmm.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@bayo-ibm
Copy link
Contributor Author

I added the new feature that allows fast model loading for inference.

@BrandonGroth BrandonGroth changed the title Bayo local feat: Fast model loading for inference May 20, 2025
@github-actions github-actions bot added the feat label May 20, 2025
fms_mo/prep.py Outdated
"""Check if model is already quantized - do not want to quantize twice if so"""
return any(isinstance(m, quantized_modules) for m in model.modules())

def swap_qbmm(model, qcfg):
Copy link
Collaborator

@BrandonGroth BrandonGroth May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add doc string and add datatypes to function args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""Read config in json format, work together with qconfig_save"""
config = get_recipe(fname)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead spacing here. Delete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

Copy link
Collaborator

@BrandonGroth BrandonGroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicks.

Also, please run the following and fix anything that lint or spellcheck does. "tox -e fix" will automatically change files, you just have to add + commit them. If multiple changes are needed, package them up in 1 commit if possible.

tox -e fix
tox -e lint
tox -e spellcheck

Signed-off-by: omobayode.fagbohungbe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants