-
Notifications
You must be signed in to change notification settings - Fork 56
Add support for loading models using the Run:AI Model Streamer #317
Description
Component
Helm Chart
Desired use case or feature
Currently the llm-d Helm chart supports two protocols for sampleApplication.model.modelArtifactURI:
hf://: This pulls the model just-in-time when vLLM starts uppvc://: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.
vLLM also supports streaming a model directly from object storage with higher concurrency with --load-format=model_streamer (docs). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.
Proposed solution
In order to use the model streamer, vLLM needs additional command line arguments:
--load-format runai_streamer- Model Name: Can be specified either through the
--modelargument or directly as a served model name (eg:--model=s3://<path-to-model>orvllm serve s3://<path-to-model>.
I propose to add an optional .modelService.vllm.loadFormat parameter to the helm chart. When set to runai_streamer, relax the "Protocol" constraint (remove the model source check when the runai_streamer vLLM load format is specified). The loadFormat will also pass the --load-format command line argument to vLLM.
- When
pvc://is specified as the protocol, this allows thepvc://protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today. - When the protocol is not recognized (eg:
s3://), themodelArtifactsURIwill be used as the model name, passing throughs3://<path-to-model>as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the.ModelPath).
Additionally, loading can be tuned with the parameter --model-loader-extra-config, or environment variables to vLLM. Command line args can be passed in through .sampleApplication.decode.extraArgs or .sampleApplication.prefill.extraArgs today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: .modelService.vllm.extraArgs and .modelService.vllm.extraEnvVars parameters).
Alternatives
Another option could be to add a new runai_streamer "Protocol" to the modelArtifactsURI chart parameter. This could encode both the object storage URI / filesystem path.
- If an object storage system is used, the
runai_streamerprotocol would be unwrapped to identify the underlying model protocol. For examplerunai_streamer://s3://<path_to_model>would allow the suffixs3://<path_to_model>to be used as the model name for the inference server. - If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg:
runai_streamer://pvc://<pvc_name>/<path_to_model>)
I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script.
Additional context or screenshots
No response