Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a llama.cpp LLM Component #1052

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
397f7b8
First commit of llamacpp Opea component
edlee123 Dec 20, 2024
cb4f5e5
Removed unneeded requirements file
edlee123 Dec 20, 2024
df3d943
Merge branch 'main' into llamacpp
edlee123 Dec 20, 2024
8893f38
Merge branch 'main' into llamacpp
edlee123 Dec 28, 2024
2a48bae
Pin the llama.cpp server version, and fix small typo
edlee123 Jan 6, 2025
644ecce
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 Jan 6, 2025
4e82152
Update README.md to describe hardware support, and provide reference.
edlee123 Jan 6, 2025
baf381d
Updated docker_compose_llm.yaml so that the llamacpp-server so the pu…
edlee123 Jan 6, 2025
7bab970
Merge branch 'main' into llamacpp
edlee123 Jan 6, 2025
e4f4b70
Merge branch 'main' into llamacpp
edlee123 Jan 7, 2025
9d7539d
Small adjustments to README.md
edlee123 Jan 7, 2025
2cf25e5
Merge branch 'main' into llamacpp
edlee123 Jan 8, 2025
fd15ee7
This removes unneeded dependencies in the Dockerfile, unneeded entryp…
edlee123 Jan 10, 2025
666196c
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 Jan 10, 2025
104527a
Merge branch 'main' into llamacpp
edlee123 Jan 10, 2025
c931902
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 10, 2025
6b98403
Merge branch 'main' into llamacpp
edlee123 Jan 24, 2025
240d3d1
Merge branch 'main' into llamacpp
edlee123 Feb 3, 2025
91e0fd4
Merge branch 'main' into llamacpp
edlee123 Feb 14, 2025
a75d28d
Refactored llama cpp and text-generation README_llamacpp.md
edlee123 Feb 14, 2025
830da58
Delete unrefactored files
edlee123 Feb 14, 2025
8d058bb
Adding llama.cpp backend include in the compose_text-genearation.yaml
edlee123 Feb 14, 2025
a0294a5
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 Feb 14, 2025
a6740b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 14, 2025
d0e27bf
Fix service name
edlee123 Feb 21, 2025
91324af
Revise llamacpp, using smaller Qwen model and remove unnecessary curl…
edlee123 Feb 21, 2025
f295e29
Update llamacpp thirdparty readme to use smaller model
edlee123 Feb 21, 2025
480cb69
Fix healthcheck in llamacpp deployment compose.yaml
edlee123 Feb 21, 2025
2c9f877
Wrote a test and tested for llamacpp text gen service
edlee123 Feb 21, 2025
f3147f1
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 Feb 21, 2025
7310d6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 21, 2025
80ed9b0
Merge branch 'main' into llamacpp
edlee123 Feb 21, 2025
efde309
Increase the llamacpp-server wait time
edlee123 Feb 21, 2025
1a7db52
Merge branch 'llamacpp' of github.com:edlee123/GenAIComps into llamacpp
edlee123 Feb 21, 2025
c474a64
Fixed typos on http environment variables, and volumes
edlee123 Feb 21, 2025
712f575
Splitting the llama.cpp test to use compose up on the llama.cpp third…
edlee123 Feb 21, 2025
68cc00f
add alternate command to stop and remove docker containers from previ…
edlee123 Feb 22, 2025
2dd2064
Modifying tear down of stop_docker in llamacpp tests to try to remove…
edlee123 Feb 22, 2025
dbff6fc
Adding some logs output to debug llamacpp test
edlee123 Feb 22, 2025
f184897
Found model path bug and fixed it to run llama.cpp test
edlee123 Feb 22, 2025
ea4ea38
Adjusted LLM_ENDPOINT env variable
edlee123 Feb 22, 2025
01fca03
Cleaned up test file
edlee123 Feb 22, 2025
dfd5057
Adjust host_ip env variable in scope of start_service
edlee123 Feb 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions comps/llms/text-generation/llamacpp/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
curl \
libgl1-mesa-glx \
libjemalloc-dev

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

# Assumes we're building from the GenAIComps directory.
COPY ../../../comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
pip install --no-cache-dir -r /home/user/comps/llms/text-generation/llamacpp/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/llms/text-generation/llamacpp/

ENTRYPOINT ["bash", "entrypoint.sh"]
88 changes: 88 additions & 0 deletions comps/llms/text-generation/llamacpp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Introduction

[llama.cpp](https://github.com/ggerganov/llama.cpp) provides inference in pure C/C++, and enables "LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud".

This OPEA component wraps llama.cpp server so that it can interface with other OPEA components, or for creating OPEA Megaservices.

llama.cpp supports this [hardware](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#supported-backends), and has only been tested on CPU.

To use a CUDA server please refer to [this llama.cpp reference](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#docker) and modify docker_compose_llm.yaml accordingly.

## TLDR

```bash
cd GenAIComps/
docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up
```

Please note it's instructive to run and validate each the llama.cpp server and OPEA component below.

## 1. Run the llama.cpp server

```bash
cd GenAIComps
docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up llamacpp-server --force-recreate
```

Notes:

i) If you prefer to run above in the background without screen output use `up -d` . The `--force-recreate` clears cache.

ii) To tear down the llama.cpp server and remove the container:

`docker compose -f comps/llms/text-generation/llamacpp/langchain/docker_compose_llm.yaml llamacpp-server down`

iii) For [llama.cpp settings](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) please specify them in the docker_compose_llm.yaml file.

#### Verify the llama.cpp Service:

```bash
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

## 2. Run the llama.cpp OPEA Service

This is essentially a wrapper component of Llama.cpp server. OPEA nicely standardizes and verifies LLM inputs with LLMParamsDoc class (see llm.py).

### 2.1 Build the llama.cpp OPEA image:

```bash
cd GenAIComps/
docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml up llama-opea-llm
```

Equivalently, the above can be achieved with `build` and `run` from the Dockerfile. Build:

```bash
cd GenAIComps/
docker build --no-cache -t opea/llm-llamacpp:latest \
--build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy \
-f comps/llms/text-generation/llamacpp/Dockerfile .
```

And run:

```bash
docker run --network host -e http_proxy=$http_proxy -e https_proxy=$https_proxy \
opea/llm-llamacpp:latest
```

### 2.3 Consume the llama.cpp Microservice:

```bash
curl http://127.0.0.1:9000/v1/chat/completions -X POST \
-d '{"query":"What is Deep Learning?","max_tokens":32,"top_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'
```

### Notes

Tearing down services and removing containers:

```bash
cd GenAIComps/comps/llms/text-generation/llamacpp/
docker compose -f comps/llms/text-generation/llamacpp/docker_compose_llm.yaml down
```
2 changes: 2 additions & 0 deletions comps/llms/text-generation/llamacpp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
39 changes: 39 additions & 0 deletions comps/llms/text-generation/llamacpp/docker_compose_llm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

services:
llamacpp-server:
image: ghcr.io/ggerganov/llama.cpp:server-b4419
ports:
- 8080:8080
environment:
# Refer to settings here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
# Llama.cpp is based on .gguf format, and Hugging Face offers many .gguf format models.
LLAMA_ARG_MODEL_URL: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
LLAMA_ARG_CTX_SIZE: 4096
LLAMA_ARG_N_PARALLEL: 2
LLAMA_ARG_ENDPOINT_METRICS: 1
LLAMA_ARG_PORT: 8080

llamacpp-opea-llm:
image: opea/llm-llamacpp:latest
build:
# Set this to allow COPY comps in the Dockerfile.
# When using docker compose with -f, the comps context is 4 levels down from docker_compose_llm.yaml.
context: ../../../../
dockerfile: ./comps/llms/text-generation/llamacpp/Dockerfile
depends_on:
- llamacpp-server
ports:
- "9000:9000"
network_mode: "host" # equivalent to: docker run --network host ...
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
# LLAMACPP_ENDPOINT: ${LLAMACPP_ENDPOINT}
restart: unless-stopped

networks:
default:
driver: bridge
8 changes: 8 additions & 0 deletions comps/llms/text-generation/llamacpp/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# pip --no-cache-dir install -r requirements-runtime.txt

python llm.py
65 changes: 65 additions & 0 deletions comps/llms/text-generation/llamacpp/llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

import openai
from fastapi.responses import StreamingResponse

from comps import CustomLogger, LLMParamsDoc, ServiceType, opea_microservices, register_microservice

logger = CustomLogger("llm_llamacpp")
logflag = os.getenv("LOGFLAG", False)
llamacpp_endpoint = os.getenv("LLAMACPP_ENDPOINT", "http://localhost:8080/")


# OPEA microservice wrapper of llama.cpp
# llama.cpp server uses openai API format: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
@register_microservice(
name="opea_service@llm_llamacpp",
service_type=ServiceType.LLM,
endpoint="/v1/chat/completions",
host="0.0.0.0",
port=9000,
)
async def llm_generate(input: LLMParamsDoc):
if logflag:
logger.info(input)
logger.info(llamacpp_endpoint)

client = openai.OpenAI(
base_url=llamacpp_endpoint, api_key="sk-no-key-required" # "http://<Your api-server IP>:port"
)

# Llama.cpp works with openai API format
# The openai api doesn't have top_k parameter
# https://community.openai.com/t/which-openai-gpt-models-if-any-allow-specifying-top-k/777982/2
chat_completion = client.chat.completions.create(
model=input.model,
messages=[{"role": "user", "content": input.query}],
max_tokens=input.max_tokens,
temperature=input.temperature,
top_p=input.top_p,
frequency_penalty=input.frequency_penalty,
presence_penalty=input.presence_penalty,
stream=input.streaming,
)

if input.streaming:

def stream_generator():
for c in chat_completion:
if logflag:
logger.info(c)
yield f"data: {c.model_dump_json()}\n\n"
yield "data: [DONE]\n\n"

return StreamingResponse(stream_generator(), media_type="text/event-stream")
else:
if logflag:
logger.info(chat_completion)
return chat_completion


if __name__ == "__main__":
opea_microservices["opea_service@llm_llamacpp"].start()
12 changes: 12 additions & 0 deletions comps/llms/text-generation/llamacpp/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
aiohttp
docarray[full]
fastapi
huggingface_hub
openai
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
prometheus-fastapi-instrumentator
shortuuid
transformers
uvicorn
Loading