Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download model from Object Storage #69

Open
nitin302 opened this issue Feb 6, 2025 · 8 comments
Open

Download model from Object Storage #69

nitin302 opened this issue Feb 6, 2025 · 8 comments
Labels
feature request New feature or request

Comments

@nitin302
Copy link

nitin302 commented Feb 6, 2025

Need a solution to download a model from AWS storage, Azure Blob Storage or MinIO.

Creating this issue as requested here - #67 (comment)

@ApostaC ApostaC added the feature request New feature or request label Feb 6, 2025
@ApostaC
Copy link
Collaborator

ApostaC commented Feb 6, 2025

Thanks for helping us organize the feature requests! Will work on this soon!

@xqe2011
Copy link

xqe2011 commented Feb 7, 2025

Looking forward to this feature too. I am using RunAI Model Streamer to load models directly from s3 now.

@noa-neria
Copy link

noa-neria commented Feb 9, 2025

Hi from the RunAI team,

Happy to confirm that you can load models directly from object storage in the production stack, by adding the necessary flags and credentials in your configuration file.

Using the following configuration file, we deployed vLLM with RunAI model streamer to read the model from S3.
Any S3 compatible such as GCS, Minio etc. is also supported with additional flags, as explained here

servingEngineSpec:
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "v0.7.1"
    modelURL: "s3://core-llm/Llama-3-8b/"
    replicaCount: 1
    env:
    - name: AWS_ACCESS_KEY_ID
      value: "Your_key_here"
    - name: AWS_SECRET_ACCESS_KEY
      value: "Your_secret_here"
    - name: RUNAI_STREAMER_MEMORY_LIMIT
      value: "8388608000"
 
    pvcStorage: "5Gi"

    requestCPU: 1
    requestMemory: "30Gi"
    requestGPU: 1

    vllmConfig:
      extraArgs: ["--load-format", "runai_streamer"]
    hf_token: Your_token_here

RUNAI_STREAMER_MEMORY_LIMIT is optional memory limit in bytes. If not specified, the RunAI streamer will allocate CPU memory in the total size of the model weights.

The RunAI streamer is an open source project integrated into vLLM.

The streamer provides direct fast streaming of model weights from Safetensors files (either from file system or object storage), saturating the storage bandwidth with parallel reads. Benchmarks can be found here

@xqe2011
Copy link

xqe2011 commented Feb 9, 2025

Hi from the RunAI team,

Happy to confirm that you can load models directly from object storage in the production stack, by adding the necessary flags and credentials in your configuration file.

Using the following configuration file, we deployed vLLM with RunAI model streamer to read the model from S3. Any S3 compatible such as GCS, Minio etc. is also supported with additional flags, as explained here

servingEngineSpec:
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "v0.7.1"
    modelURL: "s3://core-llm/Llama-3-8b/"
    replicaCount: 1
    env:
    - name: AWS_ACCESS_KEY_ID
      value: "Your_key_here"
    - name: AWS_SECRET_ACCESS_KEY
      value: "Your_secret_here"
    - name: RUNAI_STREAMER_MEMORY_LIMIT
      value: "8388608000"
 
    pvcStorage: "5Gi"

    requestCPU: 1
    requestMemory: "30Gi"
    requestGPU: 1

    vllmConfig:
      extraArgs: ["--load-format", "runai_streamer"]
    hf_token: Your_token_here

RUNAI_STREAMER_MEMORY_LIMIT is optional memory limit in bytes. If not specified, the RunAI streamer will allocate CPU memory in the total size of the model weights.

The RunAI streamer is an open source project integrated into vLLM.

The streamer provides direct fast streaming of model weights from Safetensors files (either from file system or object storage), saturating the storage bandwidth with parallel reads. Benchmarks can be found here

@noa-neria Thank you. I have tested this method, but the RunAI Streamer has some limitations when loading from S3. For example, it can't load models that need --trust-remote-code (DeepSeek-R1-Qwen-32B). Additionally, This method bypasses the Linux file cache, which takes more time to transfer from the remote. We fall back to vllm charts finally.

@noa-neria
Copy link

@xqe2011 we appreciate your feedback!

The --trust-remote-code is fully supported (not related to the streamer) and can be passed in extraArgs.

The ModelScope support was not yet added to the Runai Loader, and we are working to fix it.
However, It is not related to downloading from object storage. Similar to the default loader in vLLM, the Runai Loader should download from the repository if the path is not local or S3, but currently we only download from HuggingFace.
That is not a problem if all the model files are already in the object storage.

You have also mentioned the performance issue when the model is distributed on several devices and nodes.
In that case each vLLM process is reading the full model from the object storage, which is not efficient.
We are now working to support loading in sharded mode, which will be most efficient as the entire model will be loaded only once.

@xqe2011
Copy link

xqe2011 commented Feb 11, 2025

@xqe2011 we appreciate your feedback!

The --trust-remote-code is fully supported (not related to the streamer) and can be passed in extraArgs.

The ModelScope support was not yet added to the Runai Loader, and we are working to fix it. However, It is not related to downloading from object storage. Similar to the default loader in vLLM, the Runai Loader should download from the repository if the path is not local or S3, but currently we only download from HuggingFace. That is not a problem if all the model files are already in the object storage.

You have also mentioned the performance issue when the model is distributed on several devices and nodes. In that case each vLLM process is reading the full model from the object storage, which is not efficient. We are now working to support loading in sharded mode, which will be most efficient as the entire model will be loaded only once.

@noa-neria Well, I know why... We are using the fs backend of MinIO, so the s3:// url must ends with /. If not, the ListBucketV2 API will not return files.

@noa-neria
Copy link

@xqe2011 we have an issue configuring for S3 compatible such as Minio. It will be fixed soon, and a workaround is to pass the endpoint url via two environment variables:
AWS_ENDPOINT_URL=endpoint_url RUNAI_STREAMER_S3_ENDPOINT=endpoint_url

url that ends with / is supported

@gaocegege
Copy link
Collaborator

That’s awesome! It not only supports object storage but also speeds things up. Maybe we should add some info about it in the tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants