diff --git a/LLMs/vllm/README.md b/LLMs/vllm/README.md
new file mode 100644
index 0000000..7e0a627
--- /dev/null
+++ b/LLMs/vllm/README.md
@@ -0,0 +1,15 @@
+# vLLM for serving LLMs
+
+## Sample codes
+
+- [CPU offload](./samples/cpu_offload.py)
+- [LoRA Quantization](./samples/lora_with_quantization.py)
+- [Multi-LoRA](./samples/multilora_inference.py)
+- Offline inference
+    - [Audio-Language Inference](./samples/offline_inference_audio_language.py)
+    - [Tensor Parallel Inference](./samples/offline_inference_distributed.py)
+    - [LLM2Vec Embeddings](./samples/offline_inference_embedding.py)
+    - [Run Pixtral](./samples/offline_inference_pixtral.py)
+    - [Run with Profiler](./samples/offline_inference_with_profiler.py)
+    - [Speculation Decoding](./samples/offline_inference_speculator.py)
+    - [Vision-Language multi-image Inference](./samples/offline_inference_vision_language_multi_image.py)
diff --git a/LLMs/vllm/logging_configuration.md b/LLMs/vllm/logging_configuration.md
new file mode 100644
index 0000000..0d278b0
--- /dev/null
+++ b/LLMs/vllm/logging_configuration.md
@@ -0,0 +1,172 @@
+# Logging Configuration
+
+vLLM leverages Python's `logging.config.dictConfig` functionality to enable
+robust and flexible configuration of the various loggers used by vLLM.
+
+vLLM offers two environment variables that can be used to accommodate a range
+of logging configurations that range from simple-and-inflexible to
+more-complex-and-more-flexible.
+
+- No vLLM logging (simple and inflexible)
+  - Set `VLLM_CONFIGURE_LOGGING=0` (leaving `VLLM_LOGGING_CONFIG_PATH` unset)
+- vLLM's default logging configuration (simple and inflexible)
+  - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1`
+- Fine-grained custom logging configuration (more complex, more flexible)
+  - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and
+    set `VLLM_LOGGING_CONFIG_PATH=<path-to-logging-config.json>`
+
+
+## Logging Configuration Environment Variables
+
+### `VLLM_CONFIGURE_LOGGING`
+
+`VLLM_CONFIGURE_LOGGING` controls whether or not vLLM takes any action to
+configure the loggers used by vLLM. This functionality is enabled by default,
+but can be disabled by setting `VLLM_CONFIGURE_LOGGING=0` when running vLLM.
+
+If `VLLM_CONFIGURE_LOGGING` is enabled and no value is given for
+`VLLM_LOGGING_CONFIG_PATH`, vLLM will use built-in default configuration to
+configure the root vLLM logger. By default, no other vLLM loggers are
+configured and, as such, all vLLM loggers defer to the root vLLM logger to make
+all logging decisions.
+
+If `VLLM_CONFIGURE_LOGGING` is disabled and a value is given for
+`VLLM_LOGGING_CONFIG_PATH`, an error will occur while starting vLLM.
+
+### `VLLM_LOGGING_CONFIG_PATH`
+
+`VLLM_LOGGING_CONFIG_PATH` allows users to specify a path to a JSON file of
+alternative, custom logging configuration that will be used instead of vLLM's
+built-in default logging configuration. The logging configuration should be
+provided in JSON format following the schema specified by Python's [logging
+configuration dictionary
+schema](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details).
+
+If `VLLM_LOGGING_CONFIG_PATH` is specified, but `VLLM_CONFIGURE_LOGGING` is
+disabled, an error will occur while starting vLLM.
+
+
+## Examples
+
+### Example 1: Customize vLLM root logger
+
+For this example, we will customize the vLLM root logger to use
+[`python-json-logger`](https://github.com/madzak/python-json-logger) to log to
+STDOUT of the console in JSON format with a log level of `INFO`.
+
+To begin, first, create an appropriate JSON logging configuration file:
+
+**/path/to/logging_config.json:**
+
+```json
+{
+  "formatters": {
+    "json": {
+      "class": "pythonjsonlogger.jsonlogger.JsonFormatter"
+    }
+  },
+  "handlers": {
+    "console": {
+      "class" : "logging.StreamHandler",
+      "formatter": "json",
+      "level": "INFO",
+      "stream": "ext://sys.stdout"
+    }
+  },
+  "loggers": {
+    "vllm": {
+      "handlers": ["console"],
+      "level": "INFO",
+      "propagate": false
+    }
+  },
+  "version": 1
+}
+```
+
+Next, install the `python-json-logger` package if it's not already installed:
+
+```bash
+pip install python-json-logger
+```
+
+Finally, run vLLM with the `VLLM_LOGGING_CONFIG_PATH` environment variable set
+to the path of the custom logging configuration JSON file:
+
+```bash
+VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
+    vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
+```
+
+
+### Example 2: Silence a particular vLLM logger
+
+To silence a particular vLLM logger, it is necessary to provide custom logging
+configuration for the target logger that configures the logger so that it won't
+propagate its log messages to the root vLLM logger.
+
+When custom configuration is provided for any logger, it is also necessary to
+provide configuration for the root vLLM logger since any custom logger
+configuration overrides the built-in default logging configuration used by vLLM.
+
+First, create an appropriate JSON logging configuration file that includes
+configuration for the root vLLM logger and for the logger you wish to silence:
+
+**/path/to/logging_config.json:**
+
+```json
+{
+  "formatters": {
+    "vllm": {
+      "class": "vllm.logging.NewLineFormatter",
+      "datefmt": "%m-%d %H:%M:%S",
+      "format": "%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s"
+    }
+  },
+  "handlers": {
+    "vllm": {
+      "class" : "logging.StreamHandler",
+      "formatter": "vllm",
+      "level": "INFO",
+      "stream": "ext://sys.stdout"
+    }
+  },
+  "loggers": {
+    "vllm": {
+      "handlers": ["vllm"],
+      "level": "DEBUG",
+      "propagage": false
+    },
+    "vllm.example_noisy_logger": {
+      "propagate": false
+    }
+  },
+  "version": 1
+}
+```
+
+Finally, run vLLM with the `VLLM_LOGGING_CONFIG_PATH` environment variable set
+to the path of the custom logging configuration JSON file:
+
+```bash
+VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
+    vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
+```
+
+
+### Example 3: Disable vLLM default logging configuration
+
+To disable vLLM's default logging configuration and silence all vLLM loggers,
+simple set `VLLM_CONFIGURE_LOGGING=0` when running vLLM. This will prevent vLLM
+for configuring the root vLLM logger, which in turn, silences all other vLLM
+loggers.
+
+```bash
+VLLM_CONFIGURE_LOGGING=0 \
+    vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
+```
+
+
+## Additional resources
+
+- [`logging.config` Dictionary Schema Details](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details)
diff --git a/LLMs/vllm/play_with_vllm.ipynb b/LLMs/vllm/play_with_vllm.ipynb
new file mode 100644
index 0000000..5cead32
--- /dev/null
+++ b/LLMs/vllm/play_with_vllm.ipynb
@@ -0,0 +1 @@
+{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"nvidiaTeslaT4","dataSources":[{"sourceId":166368,"sourceType":"modelInstanceVersion","isSourceIdPinned":true,"modelInstanceId":141565,"modelId":164048}],"dockerImageVersionId":30823,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"!pip install torch triton","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:53:23.258035Z","iopub.execute_input":"2024-12-29T09:53:23.258318Z","iopub.status.idle":"2024-12-29T09:53:34.035084Z","shell.execute_reply.started":"2024-12-29T09:53:23.258290Z","shell.execute_reply":"2024-12-29T09:53:34.034248Z"}},"outputs":[{"name":"stdout","text":"Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.4.1+cu121)\nCollecting triton\n  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)\nRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\nRequirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\nRequirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.3)\nRequirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.3)\nRequirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\nRequirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.6.1)\nRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.5)\nRequirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)\nDownloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.5/209.5 MB\u001b[0m \u001b[31m8.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hInstalling collected packages: triton\nSuccessfully installed triton-3.1.0\n","output_type":"stream"}],"execution_count":1},{"cell_type":"code","source":"!pip install vllm","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:53:34.036093Z","iopub.execute_input":"2024-12-29T09:53:34.036387Z","iopub.status.idle":"2024-12-29T09:56:46.803684Z","shell.execute_reply.started":"2024-12-29T09:53:34.036359Z","shell.execute_reply":"2024-12-29T09:56:46.802650Z"}},"outputs":[{"name":"stdout","text":"Collecting vllm\n  Downloading vllm-0.6.6.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (11 kB)\nRequirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from vllm) (5.9.5)\nRequirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from vllm) (0.2.0)\nRequirement already satisfied: numpy<2.0.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (1.26.4)\nRequirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.32.3)\nRequirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from vllm) (4.66.5)\nCollecting blake3 (from vllm)\n  Downloading blake3-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)\nRequirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from vllm) (9.0.0)\nCollecting transformers>=4.45.2 (from vllm)\n  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.1/44.1 kB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.19.1)\nRequirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from vllm) (3.20.3)\nCollecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)\n  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)\nRequirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from vllm) (3.10.5)\nCollecting openai>=1.52.0 (from vllm)\n  Downloading openai-1.58.1-py3-none-any.whl.metadata (27 kB)\nCollecting uvicorn[standard] (from vllm)\n  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)\nRequirement already satisfied: pydantic>=2.9 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.9.2)\nRequirement already satisfied: prometheus_client>=0.18.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.20.0)\nRequirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from vllm) (10.4.0)\nCollecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)\n  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)\nRequirement already satisfied: tiktoken>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.8.0)\nCollecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)\n  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)\nCollecting outlines==0.1.11 (from vllm)\n  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)\nCollecting lark==1.2.2 (from vllm)\n  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)\nCollecting xgrammar>=0.1.6 (from vllm)\n  Downloading xgrammar-0.1.8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (2.0 kB)\nRequirement already satisfied: typing_extensions>=4.10 in /usr/local/lib/python3.10/dist-packages (from vllm) (4.12.2)\nRequirement already satisfied: filelock>=3.16.1 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.16.1)\nCollecting partial-json-parser (from vllm)\n  Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)\nRequirement already satisfied: pyzmq in /usr/local/lib/python3.10/dist-packages (from vllm) (24.0.1)\nCollecting msgspec (from vllm)\n  Downloading msgspec-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)\nCollecting gguf==0.10.0 (from vllm)\n  Downloading gguf-0.10.0-py3-none-any.whl.metadata (3.5 kB)\nRequirement already satisfied: importlib_metadata in /usr/local/lib/python3.10/dist-packages (from vllm) (8.5.0)\nCollecting mistral_common>=1.5.0 (from mistral_common[opencv]>=1.5.0->vllm)\n  Downloading mistral_common-1.5.1-py3-none-any.whl.metadata (4.6 kB)\nRequirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from vllm) (6.0.2)\nRequirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (from vllm) (0.8.0)\nCollecting compressed-tensors==0.8.1 (from vllm)\n  Downloading compressed_tensors-0.8.1-py3-none-any.whl.metadata (6.8 kB)\nCollecting depyf==0.18.0 (from vllm)\n  Downloading depyf-0.18.0-py3-none-any.whl.metadata (7.1 kB)\nRequirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from vllm) (3.1.0)\nRequirement already satisfied: ray>=2.9 in /usr/local/lib/python3.10/dist-packages (from ray[default]>=2.9->vllm) (2.40.0)\nCollecting nvidia-ml-py>=12.560.30 (from vllm)\n  Downloading nvidia_ml_py-12.560.30-py3-none-any.whl.metadata (8.6 kB)\nCollecting torch==2.5.1 (from vllm)\n  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)\nCollecting torchvision==0.20.1 (from vllm)\n  Downloading torchvision-0.20.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.1 kB)\nCollecting xformers==0.0.28.post3 (from vllm)\n  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)\nCollecting astor (from depyf==0.18.0->vllm)\n  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)\nRequirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from depyf==0.18.0->vllm) (0.3.8)\nCollecting interegular (from outlines==0.1.11->vllm)\n  Downloading interegular-0.3.3-py37-none-any.whl.metadata (3.0 kB)\nRequirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from outlines==0.1.11->vllm) (3.1.4)\nRequirement already satisfied: nest_asyncio in /usr/local/lib/python3.10/dist-packages (from outlines==0.1.11->vllm) (1.6.0)\nCollecting diskcache (from outlines==0.1.11->vllm)\n  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)\nRequirement already satisfied: referencing in /usr/local/lib/python3.10/dist-packages (from outlines==0.1.11->vllm) (0.35.1)\nRequirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from outlines==0.1.11->vllm) (4.23.0)\nCollecting pycountry (from outlines==0.1.11->vllm)\n  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)\nCollecting airportsdata (from outlines==0.1.11->vllm)\n  Downloading airportsdata-20241001-py3-none-any.whl.metadata (8.9 kB)\nCollecting outlines_core==0.1.26 (from outlines==0.1.11->vllm)\n  Downloading outlines_core-0.1.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\nRequirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.5.1->vllm) (3.3)\nRequirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.5.1->vllm) (2024.6.1)\nCollecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1->vllm)\n  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nCollecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1->vllm)\n  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nCollecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1->vllm)\n  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\nCollecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.5.1->vllm)\n  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\nCollecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.5.1->vllm)\n  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nCollecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.5.1->vllm)\n  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nCollecting nvidia-curand-cu12==10.3.5.147 (from torch==2.5.1->vllm)\n  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nCollecting nvidia-cusolver-cu12==11.6.1.9 (from torch==2.5.1->vllm)\n  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\nCollecting nvidia-cusparse-cu12==12.3.1.170 (from torch==2.5.1->vllm)\n  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\nCollecting nvidia-nccl-cu12==2.21.5 (from torch==2.5.1->vllm)\n  Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\nCollecting nvidia-nvtx-cu12==12.4.127 (from torch==2.5.1->vllm)\n  Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)\nCollecting nvidia-nvjitlink-cu12==12.4.127 (from torch==2.5.1->vllm)\n  Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\nRequirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.5.1->vllm) (3.1.0)\nCollecting sympy==1.13.1 (from torch==2.5.1->vllm)\n  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)\nRequirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch==2.5.1->vllm) (1.3.0)\nCollecting starlette<0.42.0,>=0.40.0 (from fastapi!=0.113.*,!=0.114.0,>=0.107.0->vllm)\n  Downloading starlette-0.41.3-py3-none-any.whl.metadata (6.0 kB)\nRequirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer<0.11,>=0.10.9->vllm) (24.1)\nCollecting tiktoken>=0.6.0 (from vllm)\n  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\nRequirement already satisfied: opencv-python-headless<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from mistral_common[opencv]>=1.5.0->vllm) (4.10.0.84)\nRequirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai>=1.52.0->vllm) (3.7.1)\nRequirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai>=1.52.0->vllm) (1.7.0)\nCollecting httpx<1,>=0.23.0 (from openai>=1.52.0->vllm)\n  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)\nCollecting jiter<1,>=0.4.0 (from openai>=1.52.0->vllm)\n  Downloading jiter-0.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)\nRequirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai>=1.52.0->vllm) (1.3.1)\nRequirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.9->vllm) (0.7.0)\nRequirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.9->vllm) (2.23.4)\nRequirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->ray[default]>=2.9->vllm) (8.1.7)\nRequirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->ray[default]>=2.9->vllm) (1.0.8)\nRequirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->ray[default]>=2.9->vllm) (1.3.1)\nRequirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->ray[default]>=2.9->vllm) (1.4.1)\nCollecting aiohttp-cors (from ray[default]>=2.9->vllm)\n  Downloading aiohttp_cors-0.7.0-py3-none-any.whl.metadata (20 kB)\nCollecting colorful (from ray[default]>=2.9->vllm)\n  Downloading colorful-0.5.6-py2.py3-none-any.whl.metadata (16 kB)\nCollecting py-spy>=0.2.0 (from ray[default]>=2.9->vllm)\n  Downloading py_spy-0.4.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (16 kB)\nCollecting opencensus (from ray[default]>=2.9->vllm)\n  Downloading opencensus-0.11.4-py2.py3-none-any.whl.metadata (12 kB)\nRequirement already satisfied: smart-open in /usr/local/lib/python3.10/dist-packages (from ray[default]>=2.9->vllm) (7.0.4)\nCollecting virtualenv!=20.21.1,>=20.0.24 (from ray[default]>=2.9->vllm)\n  Downloading virtualenv-20.28.0-py3-none-any.whl.metadata (4.4 kB)\nRequirement already satisfied: grpcio>=1.42.0 in /usr/local/lib/python3.10/dist-packages (from ray[default]>=2.9->vllm) (1.64.1)\nCollecting memray (from ray[default]>=2.9->vllm)\n  Downloading memray-1.15.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (19 kB)\nRequirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm) (2.4.0)\nRequirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm) (24.2.0)\nRequirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm) (6.1.0)\nRequirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm) (1.11.1)\nRequirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->vllm) (4.0.3)\nRequirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->vllm) (3.3.2)\nRequirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->vllm) (3.10)\nRequirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->vllm) (2.2.3)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->vllm) (2024.8.30)\nRequirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken>=0.6.0->vllm) (2024.9.11)\nRequirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.19.1->vllm) (0.24.7)\nCollecting tokenizers>=0.19.1 (from vllm)\n  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\nRequirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.45.2->vllm) (0.4.5)\nRequirement already satisfied: pybind11 in /usr/local/lib/python3.10/dist-packages (from xgrammar>=0.1.6->vllm) (2.13.6)\nRequirement already satisfied: pytest in /usr/local/lib/python3.10/dist-packages (from xgrammar>=0.1.6->vllm) (7.4.4)\nRequirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib_metadata->vllm) (3.20.2)\nCollecting h11>=0.8 (from uvicorn[standard]->vllm)\n  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)\nCollecting httptools>=0.6.3 (from uvicorn[standard]->vllm)\n  Downloading httptools-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)\nCollecting python-dotenv>=0.13 (from uvicorn[standard]->vllm)\n  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)\nCollecting uvloop!=0.15.0,!=0.15.1,>=0.14.0 (from uvicorn[standard]->vllm)\n  Downloading uvloop-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\nCollecting watchfiles>=0.13 (from uvicorn[standard]->vllm)\n  Downloading watchfiles-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\nRequirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (14.1)\nRequirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai>=1.52.0->vllm) (1.2.2)\nCollecting httpcore==1.* (from httpx<1,>=0.23.0->openai>=1.52.0->vllm)\n  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)\nRequirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.1.11->vllm) (2023.12.1)\nRequirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.1.11->vllm) (0.20.0)\nCollecting distlib<1,>=0.3.7 (from virtualenv!=20.21.1,>=20.0.24->ray[default]>=2.9->vllm)\n  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)\nRequirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv!=20.21.1,>=20.0.24->ray[default]>=2.9->vllm) (4.3.6)\nRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->outlines==0.1.11->vllm) (2.1.5)\nRequirement already satisfied: rich>=11.2.0 in /usr/local/lib/python3.10/dist-packages (from memray->ray[default]>=2.9->vllm) (13.8.1)\nCollecting textual>=0.41.0 (from memray->ray[default]>=2.9->vllm)\n  Downloading textual-1.0.0-py3-none-any.whl.metadata (9.0 kB)\nCollecting opencensus-context>=0.1.3 (from opencensus->ray[default]>=2.9->vllm)\n  Downloading opencensus_context-0.1.3-py2.py3-none-any.whl.metadata (3.3 kB)\nRequirement already satisfied: six~=1.16 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[default]>=2.9->vllm) (1.16.0)\nRequirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from opencensus->ray[default]>=2.9->vllm) (1.34.1)\nRequirement already satisfied: iniconfig in /usr/local/lib/python3.10/dist-packages (from pytest->xgrammar>=0.1.6->vllm) (2.0.0)\nRequirement already satisfied: pluggy<2.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from pytest->xgrammar>=0.1.6->vllm) (1.5.0)\nRequirement already satisfied: tomli>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pytest->xgrammar>=0.1.6->vllm) (2.0.1)\nRequirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from smart-open->ray[default]>=2.9->vllm) (1.16.0)\nRequirement already satisfied: googleapis-common-protos<2.0dev,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (1.65.0)\nRequirement already satisfied: google-auth<3.0dev,>=1.25.0 in /usr/local/lib/python3.10/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (2.27.0)\nRequirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[default]>=2.9->vllm) (3.0.0)\nRequirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=11.2.0->memray->ray[default]>=2.9->vllm) (2.18.0)\nRequirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0dev,>=1.25.0->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (5.5.0)\nRequirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0dev,>=1.25.0->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (0.4.1)\nRequirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0dev,>=1.25.0->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (4.9)\nRequirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=11.2.0->memray->ray[default]>=2.9->vllm) (0.1.2)\nRequirement already satisfied: linkify-it-py<3,>=1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[default]>=2.9->vllm) (2.0.3)\nRequirement already satisfied: mdit-py-plugins in /usr/local/lib/python3.10/dist-packages (from markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[default]>=2.9->vllm) (0.4.2)\nRequirement already satisfied: uc-micro-py in /usr/local/lib/python3.10/dist-packages (from linkify-it-py<3,>=1->markdown-it-py[linkify,plugins]>=2.1.0->textual>=0.41.0->memray->ray[default]>=2.9->vllm) (1.0.3)\nRequirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3.0dev,>=1.25.0->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=2.9->vllm) (0.6.1)\nDownloading vllm-0.6.6.post1-cp38-abi3-manylinux1_x86_64.whl (201.1 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m201.1/201.1 MB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading compressed_tensors-0.8.1-py3-none-any.whl (87 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m87.5/87.5 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading depyf-0.18.0-py3-none-any.whl (38 kB)\nDownloading gguf-0.10.0-py3-none-any.whl (71 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m71.6/71.6 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading lark-1.2.2-py3-none-any.whl (111 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.0/111.0 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading outlines-0.1.11-py3-none-any.whl (87 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m87.6/87.6 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m906.4/906.4 MB\u001b[0m \u001b[31m1.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading torchvision-0.20.1-cp310-cp310-manylinux1_x86_64.whl (7.2 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.2/7.2 MB\u001b[0m \u001b[31m107.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl (16.7 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m16.7/16.7 MB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.8/13.8 MB\u001b[0m \u001b[31m100.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m0:01\u001b[0m\n\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.6/24.6 MB\u001b[0m \u001b[31m77.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m883.7/883.7 kB\u001b[0m \u001b[31m40.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m31.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m8.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m188.7/188.7 MB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m84.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading outlines_core-0.1.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (343 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m343.6/343.6 kB\u001b[0m \u001b[31m22.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading sympy-1.13.1-py3-none-any.whl (6.2 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m102.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading fastapi-0.115.6-py3-none-any.whl (94 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m94.8/94.8 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading lm_format_enforcer-0.10.9-py3-none-any.whl (43 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.9/43.9 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading mistral_common-1.5.1-py3-none-any.whl (6.5 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m97.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[?25hDownloading nvidia_ml_py-12.560.30-py3-none-any.whl (40 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.5/40.5 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading openai-1.58.1-py3-none-any.whl (454 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.3/454.3 kB\u001b[0m \u001b[31m23.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl (19 kB)\nDownloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m53.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading transformers-4.47.1-py3-none-any.whl (10.1 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.1/10.1 MB\u001b[0m \u001b[31m115.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m0:01\u001b[0m\n\u001b[?25hDownloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m82.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[?25hDownloading xgrammar-0.1.8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (339 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m339.8/339.8 kB\u001b[0m \u001b[31m21.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading blake3-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (367 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m367.3/367.3 kB\u001b[0m \u001b[31m21.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading msgspec-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (211 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.6/211.6 kB\u001b[0m \u001b[31m14.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl (9.9 kB)\nDownloading h11-0.14.0-py3-none-any.whl (58 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading httptools-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (442 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.1/442.1 kB\u001b[0m \u001b[31m30.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading httpx-0.28.1-py3-none-any.whl (73 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.5/73.5 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading httpcore-1.0.7-py3-none-any.whl (78 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.6/78.6 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading interegular-0.3.3-py37-none-any.whl (23 kB)\nDownloading jiter-0.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (345 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m345.0/345.0 kB\u001b[0m \u001b[31m23.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading py_spy-0.4.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.7 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.7/2.7 MB\u001b[0m \u001b[31m85.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)\nDownloading starlette-0.41.3-py3-none-any.whl (73 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.2/73.2 kB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading uvloop-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m84.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[?25hDownloading virtualenv-20.28.0-py3-none-any.whl (4.3 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/4.3 MB\u001b[0m \u001b[31m94.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[?25hDownloading watchfiles-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (443 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m443.8/443.8 kB\u001b[0m \u001b[31m28.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading aiohttp_cors-0.7.0-py3-none-any.whl (27 kB)\nDownloading airportsdata-20241001-py3-none-any.whl (912 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m912.7/912.7 kB\u001b[0m \u001b[31m46.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading astor-0.8.1-py2.py3-none-any.whl (27 kB)\nDownloading colorful-0.5.6-py2.py3-none-any.whl (201 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m201.4/201.4 kB\u001b[0m \u001b[31m15.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading memray-1.15.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (8.3 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.3/8.3 MB\u001b[0m \u001b[31m103.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading opencensus-0.11.4-py2.py3-none-any.whl (128 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m128.2/128.2 kB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m101.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading uvicorn-0.34.0-py3-none-any.whl (62 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.3/62.3 kB\u001b[0m \u001b[31m4.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m469.0/469.0 kB\u001b[0m \u001b[31m29.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading opencensus_context-0.1.3-py2.py3-none-any.whl (5.1 kB)\nDownloading textual-1.0.0-py3-none-any.whl (660 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m660.5/660.5 kB\u001b[0m \u001b[31m37.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hInstalling collected packages: py-spy, opencensus-context, nvidia-ml-py, distlib, colorful, blake3, virtualenv, uvloop, sympy, python-dotenv, pycountry, partial-json-parser, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, msgspec, lark, jiter, interegular, httptools, h11, gguf, diskcache, astor, airportsdata, watchfiles, uvicorn, tiktoken, starlette, nvidia-cusparse-cu12, nvidia-cudnn-cu12, httpcore, depyf, tokenizers, prometheus-fastapi-instrumentator, nvidia-cusolver-cu12, lm-format-enforcer, httpx, fastapi, transformers, torch, textual, outlines_core, opencensus, openai, mistral_common, aiohttp-cors, xgrammar, xformers, torchvision, outlines, memray, compressed-tensors, vllm\n  Attempting uninstall: sympy\n    Found existing installation: sympy 1.13.3\n    Uninstalling sympy-1.13.3:\n      Successfully uninstalled sympy-1.13.3\n  Attempting uninstall: nvidia-nccl-cu12\n    Found existing installation: nvidia-nccl-cu12 2.23.4\n    Uninstalling nvidia-nccl-cu12-2.23.4:\n      Successfully uninstalled nvidia-nccl-cu12-2.23.4\n  Attempting uninstall: tiktoken\n    Found existing installation: tiktoken 0.8.0\n    Uninstalling tiktoken-0.8.0:\n      Successfully uninstalled tiktoken-0.8.0\n  Attempting uninstall: tokenizers\n    Found existing installation: tokenizers 0.19.1\n    Uninstalling tokenizers-0.19.1:\n      Successfully uninstalled tokenizers-0.19.1\n  Attempting uninstall: transformers\n    Found existing installation: transformers 4.44.2\n    Uninstalling transformers-4.44.2:\n      Successfully uninstalled transformers-4.44.2\n  Attempting uninstall: torch\n    Found existing installation: torch 2.4.1+cu121\n    Uninstalling torch-2.4.1+cu121:\n      Successfully uninstalled torch-2.4.1+cu121\n  Attempting uninstall: torchvision\n    Found existing installation: torchvision 0.19.1+cu121\n    Uninstalling torchvision-0.19.1+cu121:\n      Successfully uninstalled torchvision-0.19.1+cu121\n\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\nfastai 2.7.17 requires torch<2.5,>=1.10, but you have torch 2.5.1 which is incompatible.\ntorchaudio 2.4.1+cu121 requires torch==2.4.1, but you have torch 2.5.1 which is incompatible.\u001b[0m\u001b[31m\n\u001b[0mSuccessfully installed aiohttp-cors-0.7.0 airportsdata-20241001 astor-0.8.1 blake3-1.0.0 colorful-0.5.6 compressed-tensors-0.8.1 depyf-0.18.0 diskcache-5.6.3 distlib-0.3.9 fastapi-0.115.6 gguf-0.10.0 h11-0.14.0 httpcore-1.0.7 httptools-0.6.4 httpx-0.28.1 interegular-0.3.3 jiter-0.8.2 lark-1.2.2 lm-format-enforcer-0.10.9 memray-1.15.0 mistral_common-1.5.1 msgspec-0.19.0 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-ml-py-12.560.30 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 openai-1.58.1 opencensus-0.11.4 opencensus-context-0.1.3 outlines-0.1.11 outlines_core-0.1.26 partial-json-parser-0.2.1.1.post4 prometheus-fastapi-instrumentator-7.0.0 py-spy-0.4.0 pycountry-24.6.1 python-dotenv-1.0.1 starlette-0.41.3 sympy-1.13.1 textual-1.0.0 tiktoken-0.7.0 tokenizers-0.21.0 torch-2.5.1 torchvision-0.20.1 transformers-4.47.1 uvicorn-0.34.0 uvloop-0.21.0 virtualenv-20.28.0 vllm-0.6.6.post1 watchfiles-1.0.3 xformers-0.0.28.post3 xgrammar-0.1.8\n","output_type":"stream"}],"execution_count":2},{"cell_type":"code","source":"!pip install transformers peft accelerate","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:56:46.804728Z","iopub.execute_input":"2024-12-29T09:56:46.805048Z","iopub.status.idle":"2024-12-29T09:56:56.390170Z","shell.execute_reply.started":"2024-12-29T09:56:46.805017Z","shell.execute_reply":"2024-12-29T09:56:56.389257Z"}},"outputs":[{"name":"stdout","text":"Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.47.1)\nCollecting peft\n  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)\nRequirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.34.2)\nRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)\nRequirement already satisfied: huggingface-hub<1.0,>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.24.7)\nRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)\nRequirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\nRequirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)\nRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.9.11)\nRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)\nRequirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.21.0)\nRequirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)\nRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)\nRequirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from peft) (5.9.5)\nRequirement already satisfied: torch>=1.13.0 in /usr/local/lib/python3.10/dist-packages (from peft) (2.5.1)\nCollecting huggingface-hub<1.0,>=0.24.0 (from transformers)\n  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)\nRequirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (2024.6.1)\nRequirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.24.0->transformers) (4.12.2)\nRequirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (3.3)\nRequirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (3.1.4)\nRequirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.127)\nRequirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.127)\nRequirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.127)\nRequirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (9.1.0.70)\nRequirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.5.8)\nRequirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (11.2.1.3)\nRequirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (10.3.5.147)\nRequirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (11.6.1.9)\nRequirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.3.1.170)\nRequirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (2.21.5)\nRequirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.127)\nRequirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (12.4.127)\nRequirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (3.1.0)\nRequirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->peft) (1.13.1)\nRequirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=1.13.0->peft) (1.3.0)\nRequirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\nRequirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)\nRequirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.8.30)\nRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.13.0->peft) (2.1.5)\nDownloading peft-0.14.0-py3-none-any.whl (374 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m374.8/374.8 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n\u001b[?25hDownloading huggingface_hub-0.27.0-py3-none-any.whl (450 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m450.5/450.5 kB\u001b[0m \u001b[31m19.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hInstalling collected packages: huggingface-hub, peft\n  Attempting uninstall: huggingface-hub\n    Found existing installation: huggingface-hub 0.24.7\n    Uninstalling huggingface-hub-0.24.7:\n      Successfully uninstalled huggingface-hub-0.24.7\nSuccessfully installed huggingface-hub-0.27.0 peft-0.14.0\n","output_type":"stream"}],"execution_count":3},{"cell_type":"code","source":"!pip install bitsandbytes optimum auto-gptq","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:56:56.394074Z","iopub.execute_input":"2024-12-29T09:56:56.394332Z","iopub.status.idle":"2024-12-29T09:57:04.830606Z","shell.execute_reply.started":"2024-12-29T09:56:56.394310Z","shell.execute_reply":"2024-12-29T09:57:04.829753Z"}},"outputs":[{"name":"stdout","text":"Collecting bitsandbytes\n  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)\nCollecting optimum\n  Downloading optimum-1.23.3-py3-none-any.whl.metadata (20 kB)\nCollecting auto-gptq\n  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)\nRequirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from bitsandbytes) (2.5.1)\nRequirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from bitsandbytes) (1.26.4)\nRequirement already satisfied: typing_extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from bitsandbytes) (4.12.2)\nCollecting coloredlogs (from optimum)\n  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)\nRequirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from optimum) (1.13.1)\nRequirement already satisfied: transformers>=4.29 in /usr/local/lib/python3.10/dist-packages (from optimum) (4.47.1)\nRequirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from optimum) (24.1)\nRequirement already satisfied: huggingface-hub>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from optimum) (0.27.0)\nRequirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from optimum) (3.2.0)\nRequirement already satisfied: accelerate>=0.26.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.34.2)\nRequirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.2.0)\nCollecting rouge (from auto-gptq)\n  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)\nCollecting gekko (from auto-gptq)\n  Downloading gekko-1.2.1-py3-none-any.whl.metadata (3.0 kB)\nRequirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.4.5)\nRequirement already satisfied: peft>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.14.0)\nRequirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (4.66.5)\nRequirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.0->auto-gptq) (5.9.5)\nRequirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.0->auto-gptq) (6.0.2)\nRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.8.0->optimum) (3.16.1)\nRequirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.8.0->optimum) (2024.6.1)\nRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.8.0->optimum) (2.32.3)\nRequirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (3.3)\nRequirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (3.1.4)\nRequirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.127)\nRequirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.127)\nRequirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.127)\nRequirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (9.1.0.70)\nRequirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.5.8)\nRequirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (11.2.1.3)\nRequirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (10.3.5.147)\nRequirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (11.6.1.9)\nRequirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.3.1.170)\nRequirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (2.21.5)\nRequirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.127)\nRequirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (12.4.127)\nRequirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->bitsandbytes) (3.1.0)\nRequirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->optimum) (1.3.0)\nRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.29->optimum) (2024.9.11)\nRequirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.29->optimum) (0.21.0)\nCollecting humanfriendly>=9.1 (from coloredlogs->optimum)\n  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)\nRequirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (18.1.0)\nRequirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (0.3.8)\nRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (2.1.4)\nRequirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (3.5.0)\nRequirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (0.70.16)\nRequirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->optimum) (3.10.5)\nRequirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from rouge->auto-gptq) (1.16.0)\nRequirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (2.4.0)\nRequirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.3.1)\nRequirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (24.2.0)\nRequirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.4.1)\nRequirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (6.1.0)\nRequirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (1.11.1)\nRequirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->optimum) (4.0.3)\nRequirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.8.0->optimum) (3.3.2)\nRequirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.8.0->optimum) (3.10)\nRequirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.8.0->optimum) (2.2.3)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.8.0->optimum) (2024.8.30)\nRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->bitsandbytes) (2.1.5)\nRequirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->optimum) (2.8.2)\nRequirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->optimum) (2024.2)\nRequirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->optimum) (2024.1)\nDownloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m69.1/69.1 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading optimum-1.23.3-py3-none-any.whl (424 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m424.1/424.1 kB\u001b[0m \u001b[31m24.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.5/23.5 MB\u001b[0m \u001b[31m78.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hDownloading gekko-1.2.1-py3-none-any.whl (13.2 MB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.2/13.2 MB\u001b[0m \u001b[31m85.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n\u001b[?25hDownloading rouge-1.0.1-py3-none-any.whl (13 kB)\nDownloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)\n\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hInstalling collected packages: rouge, humanfriendly, gekko, coloredlogs, bitsandbytes, optimum, auto-gptq\nSuccessfully installed auto-gptq-0.7.1 bitsandbytes-0.45.0 coloredlogs-15.0.1 gekko-1.2.1 humanfriendly-10.0 optimum-1.23.3 rouge-1.0.1\n","output_type":"stream"}],"execution_count":4},{"cell_type":"code","source":"!pip install logits-processor-zoo","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:05:05.453464Z","iopub.execute_input":"2024-12-29T09:05:05.453781Z","iopub.status.idle":"2024-12-29T09:05:09.065360Z","shell.execute_reply.started":"2024-12-29T09:05:05.453754Z","shell.execute_reply":"2024-12-29T09:05:09.064541Z"}},"outputs":[{"name":"stdout","text":"Collecting logits-processor-zoo\n  Downloading logits_processor_zoo-0.1.2-py3-none-any.whl.metadata (3.4 kB)\nRequirement already satisfied: accelerate>=0.26.1 in /usr/local/lib/python3.10/dist-packages (from logits-processor-zoo) (0.34.2)\nRequirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from logits-processor-zoo) (2.5.1)\nRequirement already satisfied: transformers>=4.41.2 in /usr/local/lib/python3.10/dist-packages (from logits-processor-zoo) (4.47.1)\nRequirement already satisfied: numpy<3.0.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (1.26.4)\nRequirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (24.1)\nRequirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (5.9.5)\nRequirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (6.0.2)\nRequirement already satisfied: huggingface-hub>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (0.27.0)\nRequirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.26.1->logits-processor-zoo) (0.4.5)\nRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (3.16.1)\nRequirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (4.12.2)\nRequirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (3.3)\nRequirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (3.1.4)\nRequirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (2024.6.1)\nRequirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.127)\nRequirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.127)\nRequirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.127)\nRequirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (9.1.0.70)\nRequirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.5.8)\nRequirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (11.2.1.3)\nRequirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (10.3.5.147)\nRequirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (11.6.1.9)\nRequirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.3.1.170)\nRequirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (2.21.5)\nRequirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.127)\nRequirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (12.4.127)\nRequirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (3.1.0)\nRequirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch->logits-processor-zoo) (1.13.1)\nRequirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch->logits-processor-zoo) (1.3.0)\nRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.41.2->logits-processor-zoo) (2024.9.11)\nRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.41.2->logits-processor-zoo) (2.32.3)\nRequirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.41.2->logits-processor-zoo) (0.21.0)\nRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.41.2->logits-processor-zoo) (4.66.5)\nRequirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->logits-processor-zoo) (2.1.5)\nRequirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.41.2->logits-processor-zoo) (3.3.2)\nRequirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.41.2->logits-processor-zoo) (3.10)\nRequirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.41.2->logits-processor-zoo) (2.2.3)\nRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.41.2->logits-processor-zoo) (2024.8.30)\nDownloading logits_processor_zoo-0.1.2-py3-none-any.whl (26 kB)\nInstalling collected packages: logits-processor-zoo\nSuccessfully installed logits-processor-zoo-0.1.2\n","output_type":"stream"}],"execution_count":5},{"cell_type":"code","source":"import vllm\nimport numpy as np\nimport pandas as pd\nfrom transformers import PreTrainedTokenizer, AutoTokenizer\nfrom typing import List\nimport torch\n# from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor\nimport re","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:58:03.287184Z","iopub.execute_input":"2024-12-29T09:58:03.287551Z","iopub.status.idle":"2024-12-29T09:58:03.292651Z","shell.execute_reply.started":"2024-12-29T09:58:03.287518Z","shell.execute_reply":"2024-12-29T09:58:03.291729Z"}},"outputs":[],"execution_count":7},{"cell_type":"code","source":"def apply_template(query, tokenizer):\n    messages = [\n        {\n            \"role\": \"system\", \n            \"content\": \"As a helpful assistant, please provide suitable answer to the user query.\"\n        },\n        {\n            \"role\": \"user\", \n            \"content\": query,\n        }\n    ]\n    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n    return text","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:58:07.269350Z","iopub.execute_input":"2024-12-29T09:58:07.269635Z","iopub.status.idle":"2024-12-29T09:58:07.273995Z","shell.execute_reply.started":"2024-12-29T09:58:07.269612Z","shell.execute_reply":"2024-12-29T09:58:07.272959Z"}},"outputs":[],"execution_count":8},{"cell_type":"code","source":"model_path = \"/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1\"\n\nllm = vllm.LLM(\n    model_path,\n    quantization=\"awq\",\n    tensor_parallel_size=2,\n    gpu_memory_utilization=0.90,\n    trust_remote_code=True,\n    dtype=\"half\", \n    enforce_eager=True,\n    max_model_len=5120,\n    disable_log_stats=True\n)\ntokenizer = llm.get_tokenizer()","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T09:58:09.057794Z","iopub.execute_input":"2024-12-29T09:58:09.058062Z","iopub.status.idle":"2024-12-29T10:02:37.731300Z","shell.execute_reply.started":"2024-12-29T09:58:09.058042Z","shell.execute_reply":"2024-12-29T10:02:37.730260Z"}},"outputs":[{"name":"stdout","text":"INFO 12-29 09:58:19 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'score', 'classify'}. Defaulting to 'generate'.\nWARNING 12-29 09:58:20 config.py:588] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.\nINFO 12-29 09:58:20 config.py:1310] Defaulting to use mp for distributed inference\nWARNING 12-29 09:58:20 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used\nWARNING 12-29 09:58:20 config.py:642] Async output processing is not supported on the current platform type cuda.\nINFO 12-29 09:58:20 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1', speculative_config=None, tokenizer='/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=5120, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={\"splitting_ops\":[\"vllm.unified_attention\",\"vllm.unified_attention_with_output\"],\"candidate_compile_sizes\":[],\"compile_sizes\":[],\"capture_sizes\":[],\"max_capture_size\":0}, use_cached_outputs=False, \nWARNING 12-29 09:58:21 multiproc_worker_utils.py:312] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.\nINFO 12-29 09:58:21 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager\nINFO 12-29 09:58:21 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.\nINFO 12-29 09:58:21 selector.py:129] Using XFormers backend.\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:21 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:21 selector.py:129] Using XFormers backend.\n","output_type":"stream"},{"name":"stderr","text":"/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n  self.pid = os.fork()\n","output_type":"stream"},{"name":"stdout","text":"\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:21 multiproc_worker_utils.py:222] Worker ready; awaiting tasks\nINFO 12-29 09:58:23 utils.py:918] Found nccl from library libnccl.so.2\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:23 pynccl.py:69] vLLM is using nccl==2.21.5\nINFO 12-29 09:58:23 utils.py:918] Found nccl from library libnccl.so.2\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:23 pynccl.py:69] vLLM is using nccl==2.21.5\nINFO 12-29 09:58:23 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\nINFO 12-29 09:58:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\nINFO 12-29 09:58:43 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_ec68f226'), local_subscribe_port=41249, remote_subscribe_port=None)\nINFO 12-29 09:58:43 model_runner.py:1094] Starting to load model /kaggle/input/qwen2.5/transformers/32b-instruct-awq/1...\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 09:58:43 model_runner.py:1094] Starting to load model /kaggle/input/qwen2.5/transformers/32b-instruct-awq/1...\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]\n","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"2eea841bc7ee4fbd99a22b1f37f61a09"}},"metadata":{}},{"name":"stdout","text":"\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 10:02:14 model_runner.py:1099] Loading model weights took 9.0923 GB\nINFO 12-29 10:02:14 model_runner.py:1099] Loading model weights took 9.0923 GB\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 10:02:32 worker.py:241] Memory profiling takes 18.11 seconds\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 10:02:32 worker.py:241] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB\n\u001b[1;36m(VllmWorkerProcess pid=198)\u001b[0;0m INFO 12-29 10:02:32 worker.py:241] model weights take 9.09GiB; non_torch_memory takes 0.44GiB; PyTorch activation peak memory takes 0.73GiB; the rest of the memory reserved for KV Cache is 3.00GiB.\nINFO 12-29 10:02:33 worker.py:241] Memory profiling takes 18.27 seconds\nINFO 12-29 10:02:33 worker.py:241] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB\nINFO 12-29 10:02:33 worker.py:241] model weights take 9.09GiB; non_torch_memory takes 0.44GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 2.30GiB.\nINFO 12-29 10:02:33 distributed_gpu_executor.py:57] # GPU blocks: 1175, # CPU blocks: 2048\nINFO 12-29 10:02:33 distributed_gpu_executor.py:61] Maximum concurrency for 5120 tokens per request: 3.67x\nINFO 12-29 10:02:37 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 22.81 seconds\n","output_type":"stream"}],"execution_count":9},{"cell_type":"code","source":"q1 = 'What is the best way to implement the RAG based chatting engine?'\nq2 = 'Explain me who you are?'\nq3 = 'Java Spring vs FastAPI, which is better in 2024?'\n\nquestions = [q1, q2, q3]","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:02:37.733504Z","iopub.execute_input":"2024-12-29T10:02:37.733801Z","iopub.status.idle":"2024-12-29T10:02:37.738779Z","shell.execute_reply.started":"2024-12-29T10:02:37.733777Z","shell.execute_reply":"2024-12-29T10:02:37.737876Z"}},"outputs":[],"execution_count":10},{"cell_type":"code","source":"questions_for_chat = [apply_template(q, tokenizer) for q in questions]","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:02:37.740592Z","iopub.execute_input":"2024-12-29T10:02:37.740886Z","iopub.status.idle":"2024-12-29T10:02:37.776462Z","shell.execute_reply.started":"2024-12-29T10:02:37.740865Z","shell.execute_reply":"2024-12-29T10:02:37.775556Z"}},"outputs":[],"execution_count":11},{"cell_type":"code","source":"responses = llm.generate(\n    questions_for_chat,\n    vllm.SamplingParams(\n        n=1,  # Number of output sequences to return for each prompt.\n        top_k=1,  # Float that controls the cumulative probability of the top tokens to consider.\n        temperature=0,  # randomness of the sampling\n        seed=777, # Seed for reprodicibility\n        skip_special_tokens=False,  # Whether to skip special tokens in the output.\n        max_tokens=500,  # Maximum number of tokens to generate per output sequence.\n        # logits_processors=[MultipleChoiceLogitsProcessor(tokenizer, choices=[\"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\"])]\n    ),\n    use_tqdm=True\n)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:02:37.777583Z","iopub.execute_input":"2024-12-29T10:02:37.777861Z","iopub.status.idle":"2024-12-29T10:03:30.387498Z","shell.execute_reply.started":"2024-12-29T10:02:37.777832Z","shell.execute_reply":"2024-12-29T10:03:30.386772Z"}},"outputs":[{"name":"stderr","text":"Processed prompts: 100%|██████████| 3/3 [00:52<00:00, 17.52s/it, est. speed input: 2.24 toks/s, output: 20.03 toks/s]\n","output_type":"stream"}],"execution_count":12},{"cell_type":"code","source":"print(responses)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:03:30.388344Z","iopub.execute_input":"2024-12-29T10:03:30.388676Z","iopub.status.idle":"2024-12-29T10:03:30.393855Z","shell.execute_reply.started":"2024-12-29T10:03:30.388623Z","shell.execute_reply":"2024-12-29T10:03:30.393039Z"}},"outputs":[{"name":"stdout","text":"[RequestOutput(request_id=0, prompt='<|im_start|>system\\nAs a helpful assistant, please provide suitable answer to the user query.<|im_end|>\\n<|im_start|>user\\nWhat is the best way to implement the RAG based chatting engine?<|im_end|>\\n<|im_start|>assistant\\n', prompt_token_ids=[151644, 8948, 198, 2121, 264, 10950, 17847, 11, 4486, 3410, 14452, 4226, 311, 279, 1196, 3239, 13, 151645, 198, 151644, 872, 198, 3838, 374, 279, 1850, 1616, 311, 4211, 279, 431, 1890, 3118, 50967, 4712, 30, 151645, 198, 151644, 77091, 198], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='Implementing a RAG (Retrieval-Augmented Generation) based chatting engine involves combining retrieval methods with generative models to create a powerful conversational AI system. Here’s a step-by-step guide to help you get started:\\n\\n### 1. Define Your Use Case\\n- **Purpose**: Determine the specific purpose of your chatbot (e.g., customer support, information retrieval, personal assistant).\\n- **Data**: Identify the type of data you will use (e.g., documents, FAQs, product descriptions).\\n\\n### 2. Choose Your Retrieval Method\\n- **Document Retrieval**: Use techniques like TF-IDF, BM25, or dense retrieval methods (e.g., DPR, ColBERT).\\n- **Knowledge Base**: If you have a structured knowledge base, consider using graph-based retrieval methods.\\n\\n### 3. Select a Generative Model\\n- **Pre-trained Models**: Use pre-trained models like T5, BART, or GPT-3.\\n- **Fine-tuning**: Fine-tune the model on your specific dataset to improve performance.\\n\\n### 4. Integrate Retrieval and Generation\\n- **Retrieval-Augmented Generation (RAG)**: Use a model like RAG, which combines retrieval and generation. RAG uses a retriever to fetch relevant documents and a generator to produce the final response.\\n- **Hybrid Approach**: You can also implement a hybrid approach where the retrieval system fetches relevant documents, and the generative model uses these documents to generate the response.\\n\\n### 5. Data Preparation\\n- **Document Corpus**: Prepare a corpus of documents that the retrieval system can use.\\n- **Training Data**: Collect or generate training data for fine-tuning the generative model.\\n\\n### 6. Implementation\\n- **Framework**: Use a deep learning framework like Hugging Face Transformers, which provides pre-trained models and tools for fine-tuning.\\n- **Code Example**:\\n  ```python\\n  from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration\\n\\n  # Initialize tokenizer and retriever\\n  tokenizer = RagTokenizer.from_pretrained(\"facebook/rag-sequence-nq\")\\n  retriever = RagRetriever.from_pretrained(\"facebook/rag-sequence-nq\", index_name=\"exact\", use_dummy_dataset=True)\\n\\n  # Initialize model\\n  model = RagSequenceForGeneration.from_pretrained(\"facebook/rag-sequence-nq\")\\n\\n  # Example query\\n  query = \"What is the', token_ids=(62980, 287, 264, 431, 1890, 320, 12020, 7231, 831, 61635, 26980, 23470, 8, 3118, 50967, 4712, 17601, 34171, 56370, 5413, 448, 1766, 1388, 4119, 311, 1855, 264, 7988, 7517, 1663, 15235, 1849, 13, 5692, 748, 264, 3019, 14319, 29208, 8474, 311, 1492, 498, 633, 3855, 1447, 14374, 220, 16, 13, 18614, 4615, 5443, 11538, 198, 12, 3070, 74033, 95518, 29901, 279, 3151, 7428, 315, 697, 6236, 6331, 320, 68, 1302, 2572, 6002, 1824, 11, 1995, 56370, 11, 4345, 17847, 4292, 12, 3070, 1043, 95518, 64547, 279, 943, 315, 821, 498, 686, 990, 320, 68, 1302, 2572, 9293, 11, 86584, 11, 1985, 27787, 3593, 14374, 220, 17, 13, 22201, 4615, 19470, 831, 6730, 198, 12, 3070, 7524, 19470, 831, 95518, 5443, 12538, 1075, 29145, 53365, 37, 11, 19800, 17, 20, 11, 476, 27950, 56370, 5413, 320, 68, 1302, 2572, 87901, 11, 4254, 61437, 4292, 12, 3070, 80334, 5351, 95518, 1416, 498, 614, 264, 32930, 6540, 2331, 11, 2908, 1667, 4771, 5980, 56370, 5413, 382, 14374, 220, 18, 13, 8427, 264, 2607, 1388, 4903, 198, 12, 3070, 4703, 68924, 26874, 95518, 5443, 855, 68924, 4119, 1075, 350, 20, 11, 425, 2992, 11, 476, 479, 2828, 12, 18, 624, 12, 3070, 63716, 2385, 37202, 95518, 30153, 2385, 2886, 279, 1614, 389, 697, 3151, 10337, 311, 7269, 5068, 382, 14374, 220, 19, 13, 1333, 57017, 19470, 831, 323, 23470, 198, 12, 3070, 12020, 7231, 831, 61635, 26980, 23470, 320, 49, 1890, 32295, 25, 5443, 264, 1614, 1075, 431, 1890, 11, 892, 32411, 56370, 323, 9471, 13, 431, 1890, 5711, 264, 10759, 423, 311, 7807, 9760, 9293, 323, 264, 13823, 311, 8193, 279, 1590, 2033, 624, 12, 3070, 30816, 16223, 53084, 95518, 1446, 646, 1083, 4211, 264, 24989, 5486, 1380, 279, 56370, 1849, 7807, 288, 9760, 9293, 11, 323, 279, 1766, 1388, 1614, 5711, 1493, 9293, 311, 6923, 279, 2033, 382, 14374, 220, 20, 13, 2885, 73335, 198, 12, 3070, 7524, 75734, 95518, 31166, 264, 42094, 315, 9293, 429, 279, 56370, 1849, 646, 990, 624, 12, 3070, 36930, 2885, 95518, 20513, 476, 6923, 4862, 821, 369, 6915, 2385, 37202, 279, 1766, 1388, 1614, 382, 14374, 220, 21, 13, 30813, 198, 12, 3070, 14837, 95518, 5443, 264, 5538, 6832, 12626, 1075, 472, 35268, 18596, 80532, 11, 892, 5707, 855, 68924, 4119, 323, 7375, 369, 6915, 2385, 37202, 624, 12, 3070, 2078, 13383, 334, 510, 220, 54275, 12669, 198, 220, 504, 86870, 1159, 50259, 37434, 11, 50259, 12020, 461, 2054, 11, 50259, 14076, 2461, 37138, 271, 220, 671, 9008, 45958, 323, 10759, 423, 198, 220, 45958, 284, 50259, 37434, 6387, 10442, 35722, 445, 20944, 14, 4101, 7806, 4375, 5279, 80, 1138, 220, 10759, 423, 284, 50259, 12020, 461, 2054, 6387, 10442, 35722, 445, 20944, 14, 4101, 7806, 4375, 5279, 80, 497, 1922, 1269, 428, 46385, 497, 990, 60321, 18999, 3618, 692, 220, 671, 9008, 1614, 198, 220, 1614, 284, 50259, 14076, 2461, 37138, 6387, 10442, 35722, 445, 20944, 14, 4101, 7806, 4375, 5279, 80, 5130, 220, 671, 13383, 3239, 198, 220, 3239, 284, 330, 3838, 374, 279), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1735466557.788089, last_token_time=1735466557.788089, first_scheduled_time=1735466557.815613, first_token_time=1735466559.064989, time_in_queue=0.02752399444580078, finished_time=1735466610.3815641, scheduler_time=0.08793257400009225, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0, multi_modal_placeholders={}), RequestOutput(request_id=1, prompt='<|im_start|>system\\nAs a helpful assistant, please provide suitable answer to the user query.<|im_end|>\\n<|im_start|>user\\nExplain me who you are?<|im_end|>\\n<|im_start|>assistant\\n', prompt_token_ids=[151644, 8948, 198, 2121, 264, 10950, 17847, 11, 4486, 3410, 14452, 4226, 311, 279, 1196, 3239, 13, 151645, 198, 151644, 872, 198, 840, 20772, 752, 879, 498, 525, 30, 151645, 198, 151644, 77091, 198], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=\"I am an artificial intelligence designed to assist and communicate with users like you. My purpose is to provide information, answer questions, and help with tasks to the best of my ability using the knowledge and capabilities I've been given. How can I assist you today?\", token_ids=(40, 1079, 458, 20443, 11229, 6188, 311, 7789, 323, 19032, 448, 3847, 1075, 498, 13, 3017, 7428, 374, 311, 3410, 1995, 11, 4226, 4755, 11, 323, 1492, 448, 9079, 311, 279, 1850, 315, 847, 5726, 1667, 279, 6540, 323, 16928, 358, 3003, 1012, 2661, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151645), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1735466557.812375, last_token_time=1735466557.812375, first_scheduled_time=1735466557.815613, first_token_time=1735466559.064989, time_in_queue=0.0032379627227783203, finished_time=1735466564.475595, scheduler_time=0.01299900199910553, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0, multi_modal_placeholders={}), RequestOutput(request_id=2, prompt='<|im_start|>system\\nAs a helpful assistant, please provide suitable answer to the user query.<|im_end|>\\n<|im_start|>user\\nJava Spring vs FastAPI, which is better in 2024?<|im_end|>\\n<|im_start|>assistant\\n', prompt_token_ids=[151644, 8948, 198, 2121, 264, 10950, 17847, 11, 4486, 3410, 14452, 4226, 311, 279, 1196, 3239, 13, 151645, 198, 151644, 872, 198, 15041, 12252, 6165, 17288, 7082, 11, 892, 374, 2664, 304, 220, 17, 15, 17, 19, 30, 151645, 198, 151644, 77091, 198], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='Choosing between Java Spring and FastAPI depends on your specific project requirements, team expertise, and the ecosystem you prefer. Here’s a comparison to help you decide:\\n\\n### Java Spring\\n**Pros:**\\n1. **Maturity and Stability:** Spring is a well-established framework with a large community and extensive documentation.\\n2. **Enterprise Support:** It is widely used in enterprise environments, offering robust features for large-scale applications.\\n3. **Ecosystem:** Spring has a rich ecosystem with various modules (Spring Boot, Spring Data, Spring Security) that can be easily integrated.\\n4. **Community and Support:** Large community and extensive support, making it easier to find solutions to common problems.\\n5. **Java Ecosystem:** If you are already working in a Java environment, Spring integrates seamlessly with existing Java tools and libraries.\\n\\n**Cons:**\\n1. **Verbosity:** Java and Spring can be more verbose compared to Python frameworks.\\n2. **Performance:** While Spring Boot is optimized for performance, it might not be as fast as some Python frameworks for certain use cases.\\n3. **Learning Curve:** If your team is not familiar with Java, there might be a steeper learning curve.\\n\\n### FastAPI\\n**Pros:**\\n1. **Performance:** FastAPI is built on top of Starlette and Pydantic, making it highly performant and capable of handling high traffic.\\n2. **Developer Productivity:** FastAPI is designed to be developer-friendly with auto-generated API documentation (Swagger UI) and type hints.\\n3. **Simplicity:** It is simpler and more concise compared to Java frameworks, making it easier to write and maintain code.\\n4. **Python Ecosystem:** If you are already working in a Python environment, FastAPI integrates well with Python libraries and tools.\\n5. **Modern Features:** FastAPI supports modern web features like asynchronous programming out of the box.\\n\\n**Cons:**\\n1. **Maturity:** While FastAPI is gaining popularity, it is relatively newer compared to Spring.\\n2. **Enterprise Adoption:** It may not be as widely adopted in large enterprise environments as Spring.\\n3. **Community:** Although growing, the community and support for FastAPI are not as extensive as those for Spring.\\n\\n### Conclusion\\n- **Use Java Spring if:**\\n  - You are working in an enterprise environment.\\n  - You need a mature and stable framework with extensive documentation.\\n  - You are already working in a Java ecosystem.\\n  - You require a rich set', token_ids=(95044, 1948, 7943, 12252, 323, 17288, 7082, 13798, 389, 697, 3151, 2390, 8502, 11, 2083, 18726, 11, 323, 279, 24982, 498, 10702, 13, 5692, 748, 264, 12313, 311, 1492, 498, 10279, 1447, 14374, 7943, 12252, 198, 334, 44915, 25, 1019, 16, 13, 3070, 44, 37854, 323, 80138, 66963, 12252, 374, 264, 1632, 63768, 12626, 448, 264, 3460, 3942, 323, 16376, 9705, 624, 17, 13, 3070, 85647, 9186, 66963, 1084, 374, 13570, 1483, 304, 20179, 21737, 11, 10004, 21765, 4419, 369, 3460, 12934, 8357, 624, 18, 13, 3070, 36, 23287, 66963, 12252, 702, 264, 9080, 24982, 448, 5257, 13454, 320, 25150, 15004, 11, 12252, 2885, 11, 12252, 8234, 8, 429, 646, 387, 6707, 18250, 624, 19, 13, 3070, 33768, 323, 9186, 66963, 20286, 3942, 323, 16376, 1824, 11, 3259, 432, 8661, 311, 1477, 9904, 311, 4185, 5322, 624, 20, 13, 3070, 15041, 468, 23287, 66963, 1416, 498, 525, 2669, 3238, 304, 264, 7943, 4573, 11, 12252, 74662, 60340, 448, 6350, 7943, 7375, 323, 20186, 382, 334, 15220, 25, 1019, 16, 13, 3070, 66946, 22053, 66963, 7943, 323, 12252, 646, 387, 803, 13694, 7707, 311, 13027, 48025, 624, 17, 13, 3070, 34791, 66963, 5976, 12252, 15004, 374, 33340, 369, 5068, 11, 432, 2578, 537, 387, 438, 4937, 438, 1045, 13027, 48025, 369, 3654, 990, 5048, 624, 18, 13, 3070, 47467, 53677, 66963, 1416, 697, 2083, 374, 537, 11285, 448, 7943, 11, 1052, 2578, 387, 264, 357, 43031, 6832, 15655, 382, 14374, 17288, 7082, 198, 334, 44915, 25, 1019, 16, 13, 3070, 34791, 66963, 17288, 7082, 374, 5798, 389, 1909, 315, 7679, 9809, 323, 5355, 67, 8159, 11, 3259, 432, 7548, 2736, 517, 323, 12875, 315, 11589, 1550, 9442, 624, 17, 13, 3070, 44911, 5643, 1927, 66963, 17288, 7082, 374, 6188, 311, 387, 15754, 21896, 448, 3233, 16185, 5333, 9705, 320, 67714, 3689, 8, 323, 943, 30643, 624, 18, 13, 3070, 50, 6664, 24779, 66963, 1084, 374, 34288, 323, 803, 63594, 7707, 311, 7943, 48025, 11, 3259, 432, 8661, 311, 3270, 323, 10306, 2038, 624, 19, 13, 3070, 30280, 468, 23287, 66963, 1416, 498, 525, 2669, 3238, 304, 264, 13027, 4573, 11, 17288, 7082, 74662, 1632, 448, 13027, 20186, 323, 7375, 624, 20, 13, 3070, 48452, 19710, 66963, 17288, 7082, 11554, 6481, 3482, 4419, 1075, 39007, 15473, 700, 315, 279, 3745, 382, 334, 15220, 25, 1019, 16, 13, 3070, 44, 37854, 66963, 5976, 17288, 7082, 374, 29140, 22538, 11, 432, 374, 12040, 25546, 7707, 311, 12252, 624, 17, 13, 3070, 85647, 92017, 66963, 1084, 1231, 537, 387, 438, 13570, 17827, 304, 3460, 20179, 21737, 438, 12252, 624, 18, 13, 3070, 33768, 66963, 10328, 7826, 11, 279, 3942, 323, 1824, 369, 17288, 7082, 525, 537, 438, 16376, 438, 1846, 369, 12252, 382, 14374, 73877, 198, 12, 3070, 10253, 7943, 12252, 421, 25, 1019, 220, 481, 1446, 525, 3238, 304, 458, 20179, 4573, 624, 220, 481, 1446, 1184, 264, 14851, 323, 15175, 12626, 448, 16376, 9705, 624, 220, 481, 1446, 525, 2669, 3238, 304, 264, 7943, 24982, 624, 220, 481, 1446, 1373, 264, 9080, 738), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1735466557.8131218, last_token_time=1735466557.8131218, first_scheduled_time=1735466557.815613, first_token_time=1735466559.064989, time_in_queue=0.0024912357330322266, finished_time=1735466610.3816202, scheduler_time=0.08793257400009225, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})]\n","output_type":"stream"}],"execution_count":13},{"cell_type":"code","source":"responses = [x.outputs[0].text for x in responses]","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:03:30.394497Z","iopub.execute_input":"2024-12-29T10:03:30.394761Z","iopub.status.idle":"2024-12-29T10:03:30.408788Z","shell.execute_reply.started":"2024-12-29T10:03:30.394739Z","shell.execute_reply":"2024-12-29T10:03:30.408029Z"}},"outputs":[],"execution_count":14},{"cell_type":"code","source":"print(responses)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2024-12-29T10:06:24.086336Z","iopub.execute_input":"2024-12-29T10:06:24.086713Z","iopub.status.idle":"2024-12-29T10:06:24.091307Z","shell.execute_reply.started":"2024-12-29T10:06:24.086675Z","shell.execute_reply":"2024-12-29T10:06:24.090466Z"}},"outputs":[{"name":"stdout","text":"['Implementing a RAG (Retrieval-Augmented Generation) based chatting engine involves combining retrieval methods with generative models to create a powerful conversational AI system. Here’s a step-by-step guide to help you get started:\\n\\n### 1. Define Your Use Case\\n- **Purpose**: Determine the specific purpose of your chatbot (e.g., customer support, information retrieval, personal assistant).\\n- **Data**: Identify the type of data you will use (e.g., documents, FAQs, product descriptions).\\n\\n### 2. Choose Your Retrieval Method\\n- **Document Retrieval**: Use techniques like TF-IDF, BM25, or dense retrieval methods (e.g., DPR, ColBERT).\\n- **Knowledge Base**: If you have a structured knowledge base, consider using graph-based retrieval methods.\\n\\n### 3. Select a Generative Model\\n- **Pre-trained Models**: Use pre-trained models like T5, BART, or GPT-3.\\n- **Fine-tuning**: Fine-tune the model on your specific dataset to improve performance.\\n\\n### 4. Integrate Retrieval and Generation\\n- **Retrieval-Augmented Generation (RAG)**: Use a model like RAG, which combines retrieval and generation. RAG uses a retriever to fetch relevant documents and a generator to produce the final response.\\n- **Hybrid Approach**: You can also implement a hybrid approach where the retrieval system fetches relevant documents, and the generative model uses these documents to generate the response.\\n\\n### 5. Data Preparation\\n- **Document Corpus**: Prepare a corpus of documents that the retrieval system can use.\\n- **Training Data**: Collect or generate training data for fine-tuning the generative model.\\n\\n### 6. Implementation\\n- **Framework**: Use a deep learning framework like Hugging Face Transformers, which provides pre-trained models and tools for fine-tuning.\\n- **Code Example**:\\n  ```python\\n  from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration\\n\\n  # Initialize tokenizer and retriever\\n  tokenizer = RagTokenizer.from_pretrained(\"facebook/rag-sequence-nq\")\\n  retriever = RagRetriever.from_pretrained(\"facebook/rag-sequence-nq\", index_name=\"exact\", use_dummy_dataset=True)\\n\\n  # Initialize model\\n  model = RagSequenceForGeneration.from_pretrained(\"facebook/rag-sequence-nq\")\\n\\n  # Example query\\n  query = \"What is the', \"I am an artificial intelligence designed to assist and communicate with users like you. My purpose is to provide information, answer questions, and help with tasks to the best of my ability using the knowledge and capabilities I've been given. How can I assist you today?\", 'Choosing between Java Spring and FastAPI depends on your specific project requirements, team expertise, and the ecosystem you prefer. Here’s a comparison to help you decide:\\n\\n### Java Spring\\n**Pros:**\\n1. **Maturity and Stability:** Spring is a well-established framework with a large community and extensive documentation.\\n2. **Enterprise Support:** It is widely used in enterprise environments, offering robust features for large-scale applications.\\n3. **Ecosystem:** Spring has a rich ecosystem with various modules (Spring Boot, Spring Data, Spring Security) that can be easily integrated.\\n4. **Community and Support:** Large community and extensive support, making it easier to find solutions to common problems.\\n5. **Java Ecosystem:** If you are already working in a Java environment, Spring integrates seamlessly with existing Java tools and libraries.\\n\\n**Cons:**\\n1. **Verbosity:** Java and Spring can be more verbose compared to Python frameworks.\\n2. **Performance:** While Spring Boot is optimized for performance, it might not be as fast as some Python frameworks for certain use cases.\\n3. **Learning Curve:** If your team is not familiar with Java, there might be a steeper learning curve.\\n\\n### FastAPI\\n**Pros:**\\n1. **Performance:** FastAPI is built on top of Starlette and Pydantic, making it highly performant and capable of handling high traffic.\\n2. **Developer Productivity:** FastAPI is designed to be developer-friendly with auto-generated API documentation (Swagger UI) and type hints.\\n3. **Simplicity:** It is simpler and more concise compared to Java frameworks, making it easier to write and maintain code.\\n4. **Python Ecosystem:** If you are already working in a Python environment, FastAPI integrates well with Python libraries and tools.\\n5. **Modern Features:** FastAPI supports modern web features like asynchronous programming out of the box.\\n\\n**Cons:**\\n1. **Maturity:** While FastAPI is gaining popularity, it is relatively newer compared to Spring.\\n2. **Enterprise Adoption:** It may not be as widely adopted in large enterprise environments as Spring.\\n3. **Community:** Although growing, the community and support for FastAPI are not as extensive as those for Spring.\\n\\n### Conclusion\\n- **Use Java Spring if:**\\n  - You are working in an enterprise environment.\\n  - You need a mature and stable framework with extensive documentation.\\n  - You are already working in a Java ecosystem.\\n  - You require a rich set']\n","output_type":"stream"}],"execution_count":15},{"cell_type":"code","source":"","metadata":{"trusted":true},"outputs":[],"execution_count":null}]}
\ No newline at end of file
diff --git a/LLMs/vllm/samples/cpu_offload.py b/LLMs/vllm/samples/cpu_offload.py
new file mode 100644
index 0000000..7467a00
--- /dev/null
+++ b/LLMs/vllm/samples/cpu_offload.py
@@ -0,0 +1,23 @@
+from vllm import LLM, SamplingParams
+
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
diff --git a/LLMs/vllm/samples/lora_with_quantization.py b/LLMs/vllm/samples/lora_with_quantization.py
new file mode 100644
index 0000000..1526f2c
--- /dev/null
+++ b/LLMs/vllm/samples/lora_with_quantization.py
@@ -0,0 +1,127 @@
+import gc
+from typing import List, Optional, Tuple
+
+import torch
+from huggingface_hub import snapshot_download
+
+from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
+from vllm.lora.request import LoRARequest
+
+
+def create_test_prompts(
+        lora_path: str
+) -> List[Tuple[str, SamplingParams, Optional[LoRARequest]]]:
+    return [
+        # this is an example of using quantization without LoRA
+        ("My name is",
+         SamplingParams(temperature=0.0,
+                        logprobs=1,
+                        prompt_logprobs=1,
+                        max_tokens=128), None),
+        # the next three examples use quantization with LoRA
+        ("my name is",
+         SamplingParams(temperature=0.0,
+                        logprobs=1,
+                        prompt_logprobs=1,
+                        max_tokens=128),
+         LoRARequest("lora-test-1", 1, lora_path)),
+        ("The capital of USA is",
+         SamplingParams(temperature=0.0,
+                        logprobs=1,
+                        prompt_logprobs=1,
+                        max_tokens=128),
+         LoRARequest("lora-test-2", 1, lora_path)),
+        ("The capital of France is",
+         SamplingParams(temperature=0.0,
+                        logprobs=1,
+                        prompt_logprobs=1,
+                        max_tokens=128),
+         LoRARequest("lora-test-3", 1, lora_path)),
+    ]
+
+
+def process_requests(engine: LLMEngine,
+                     test_prompts: List[Tuple[str, SamplingParams,
+                                              Optional[LoRARequest]]]):
+    """Continuously process a list of prompts and handle the outputs."""
+    request_id = 0
+
+    while test_prompts or engine.has_unfinished_requests():
+        if test_prompts:
+            prompt, sampling_params, lora_request = test_prompts.pop(0)
+            engine.add_request(str(request_id),
+                               prompt,
+                               sampling_params,
+                               lora_request=lora_request)
+            request_id += 1
+
+        request_outputs: List[RequestOutput] = engine.step()
+        for request_output in request_outputs:
+            if request_output.finished:
+                print("----------------------------------------------------")
+                print(f"Prompt: {request_output.prompt}")
+                print(f"Output: {request_output.outputs[0].text}")
+
+
+def initialize_engine(model: str, quantization: str,
+                      lora_repo: Optional[str]) -> LLMEngine:
+    """Initialize the LLMEngine."""
+
+    if quantization == "bitsandbytes":
+        # QLoRA (https://arxiv.org/abs/2305.14314) is a quantization technique.
+        # It quantizes the model when loading, with some config info from the
+        # LoRA adapter repo. So need to set the parameter of load_format and
+        # qlora_adapter_name_or_path as below.
+        engine_args = EngineArgs(model=model,
+                                 quantization=quantization,
+                                 qlora_adapter_name_or_path=lora_repo,
+                                 load_format="bitsandbytes",
+                                 enable_lora=True,
+                                 max_lora_rank=64)
+    else:
+        engine_args = EngineArgs(model=model,
+                                 quantization=quantization,
+                                 enable_lora=True,
+                                 max_loras=4)
+    return LLMEngine.from_engine_args(engine_args)
+
+
+def main():
+    """Main function that sets up and runs the prompt processing."""
+
+    test_configs = [{
+        "name": "qlora_inference_example",
+        'model': "huggyllama/llama-7b",
+        'quantization': "bitsandbytes",
+        'lora_repo': 'timdettmers/qlora-flan-7b'
+    }, {
+        "name": "AWQ_inference_with_lora_example",
+        'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ',
+        'quantization': "awq",
+        'lora_repo': 'jashing/tinyllama-colorist-lora'
+    }, {
+        "name": "GPTQ_inference_with_lora_example",
+        'model': 'TheBloke/TinyLlama-1.1B-Chat-v0.3-GPTQ',
+        'quantization': "gptq",
+        'lora_repo': 'jashing/tinyllama-colorist-lora'
+    }]
+
+    for test_config in test_configs:
+        print(
+            f"~~~~~~~~~~~~~~~~ Running: {test_config['name']} ~~~~~~~~~~~~~~~~"
+        )
+        engine = initialize_engine(test_config['model'],
+                                   test_config['quantization'],
+                                   test_config['lora_repo'])
+        lora_path = snapshot_download(repo_id=test_config['lora_repo'])
+        test_prompts = create_test_prompts(lora_path)
+        process_requests(engine, test_prompts)
+
+        # Clean up the GPU memory for the next test
+        del engine
+        gc.collect()
+        torch.cuda.empty_cache()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/LLMs/vllm/samples/multilora_inference.py b/LLMs/vllm/samples/multilora_inference.py
new file mode 100644
index 0000000..3c09106
--- /dev/null
+++ b/LLMs/vllm/samples/multilora_inference.py
@@ -0,0 +1,100 @@
+from typing import List, Optional, Tuple
+
+from huggingface_hub import snapshot_download
+
+from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
+from vllm.lora.request import LoRARequest
+
+
+def create_test_prompts(
+        lora_path: str
+) -> List[Tuple[str, SamplingParams, Optional[LoRARequest]]]:
+    """Create a list of test prompts with their sampling parameters.
+
+    2 requests for base model, 4 requests for the LoRA. We define 2
+    different LoRA adapters (using the same model for demo purposes).
+    Since we also set `max_loras=1`, the expectation is that the requests
+    with the second LoRA adapter will be ran after all requests with the
+    first adapter have finished.
+    """
+    return [
+        ("A robot may not injure a human being",
+         SamplingParams(temperature=0.0,
+                        logprobs=1,
+                        prompt_logprobs=1,
+                        max_tokens=128), None),
+        ("To be or not to be,",
+         SamplingParams(temperature=0.8,
+                        top_k=5,
+                        presence_penalty=0.2,
+                        max_tokens=128), None),
+        (
+            "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",  # noqa: E501
+            SamplingParams(temperature=0.0,
+                           logprobs=1,
+                           prompt_logprobs=1,
+                           max_tokens=128,
+                           stop_token_ids=[32003]),
+            LoRARequest("sql-lora", 1, lora_path)),
+        (
+            "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",  # noqa: E501
+            SamplingParams(temperature=0.0,
+                           logprobs=1,
+                           prompt_logprobs=1,
+                           max_tokens=128,
+                           stop_token_ids=[32003]),
+            LoRARequest("sql-lora2", 2, lora_path)),
+    ]
+
+
+def process_requests(
+    engine: LLMEngine,
+    test_prompts: List[Tuple[str, SamplingParams, Optional[LoRARequest]]]
+):
+    """Continuously process a list of prompts and handle the outputs."""
+    request_id = 0
+
+    while test_prompts or engine.has_unfinished_requests():
+        if test_prompts:
+            prompt, sampling_params, lora_request = test_prompts.pop(0)
+            engine.add_request(str(request_id),
+                               prompt,
+                               sampling_params,
+                               lora_request=lora_request)
+            request_id += 1
+
+        request_outputs: List[RequestOutput] = engine.step()
+
+        for request_output in request_outputs:
+            if request_output.finished:
+                print(request_output)
+
+
+def initialize_engine() -> LLMEngine:
+    """Initialize the LLMEngine."""
+    # max_loras: controls the number of LoRAs that can be used in the same
+    #   batch. Larger numbers will cause higher memory usage, as each LoRA
+    #   slot requires its own preallocated tensor.
+    # max_lora_rank: controls the maximum supported rank of all LoRAs. Larger
+    #   numbers will cause higher memory usage. If you know that all LoRAs will
+    #   use the same rank, it is recommended to set this as low as possible.
+    # max_cpu_loras: controls the size of the CPU LoRA cache.
+    engine_args = EngineArgs(model="meta-llama/Llama-2-7b-hf",
+                             enable_lora=True,
+                             max_loras=1,
+                             max_lora_rank=8,
+                             max_cpu_loras=2,
+                             max_num_seqs=256)
+    return LLMEngine.from_engine_args(engine_args)
+
+
+def main():
+    """Main function that sets up and runs the prompt processing."""
+    engine = initialize_engine()
+    lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
+    test_prompts = create_test_prompts(lora_path)
+    process_requests(engine, test_prompts)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/LLMs/vllm/samples/offline_inference_audio_language.py b/LLMs/vllm/samples/offline_inference_audio_language.py
new file mode 100644
index 0000000..e1d072d
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_audio_language.py
@@ -0,0 +1,101 @@
+from transformers import AutoTokenizer
+
+from vllm import LLM, SamplingParams
+from vllm.assets.audio import AudioAsset
+from vllm.utils import FlexibleArgumentParser
+
+
+audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
+question_per_audio_count = [
+    "What is recited in the audio?",
+    "What sport and what nursery rhyme are referenced?"
+]
+
+
+# Ultravox 0.3
+def run_ultravox(question, audio_count):
+    model_name = "fixie-ai/ultravox-v0_3"
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    messages = [{
+        'role':
+        'user',
+        'content':
+        "<|reserved_special_token_0|>\n" * audio_count + question
+    }]
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+
+    llm = LLM(model=model_name,
+              enforce_eager=True,
+              enable_chunked_prefill=False,
+              max_model_len=8192,
+              limit_mm_per_prompt={"audio": audio_count})
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
+
+
+model_example_map = {
+    "ultravox": run_ultravox,
+}
+
+
+def main(args):
+    model = args.model_type
+    if model not in model_example_map:
+        raise ValueError(f"Model type {model} is not supported.")
+
+    audio_count = args.num_audios
+    llm, prompt, stop_token_ids = model_example_map[model](
+        question_per_audio_count[audio_count - 1], audio_count)
+
+    # We set temperature to 0.2 so that outputs can be different
+    # even when all prompts are identical when running batch inference.
+    sampling_params = SamplingParams(temperature=0.2,
+                                     max_tokens=64,
+                                     stop_token_ids=stop_token_ids)
+
+    assert args.num_prompts > 0
+    inputs = {
+        "prompt": prompt,
+        "multi_modal_data": {
+            "audio": [
+                asset.audio_and_sample_rate
+                for asset in audio_assets[:audio_count]
+            ]
+        },
+    }
+    if args.num_prompts > 1:
+        # Batch inference
+        inputs = [inputs] * args.num_prompts
+
+    outputs = llm.generate(inputs, sampling_params=sampling_params)
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser(
+        description='Demo on using vLLM for offline inference with '
+        'audio language models')
+    parser.add_argument('--model-type',
+                        '-m',
+                        type=str,
+                        default="ultravox",
+                        choices=model_example_map.keys(),
+                        help='Huggingface "model_type".')
+    parser.add_argument('--num-prompts',
+                        type=int,
+                        default=1,
+                        help='Number of prompts to run.')
+    parser.add_argument("--num-audios",
+                        type=int,
+                        default=1,
+                        choices=[1, 2],
+                        help="Number of audio items per prompt.")
+
+    args = parser.parse_args()
+    main(args)
diff --git a/LLMs/vllm/samples/offline_inference_distributed.py b/LLMs/vllm/samples/offline_inference_distributed.py
new file mode 100644
index 0000000..69ce501
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_distributed.py
@@ -0,0 +1,103 @@
+from typing import Any, Dict, List
+
+import numpy as np
+import ray
+from packaging.version import Version
+from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
+
+from vllm import LLM, SamplingParams
+
+
+assert Version(ray.__version__) >= Version("2.22.0"), "Ray version must be at least 2.22.0"
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Set tensor parallelism per instance.
+tensor_parallel_size = 1
+
+# Set number of instances. Each instance will use tensor_parallel_size GPUs.
+num_instances = 1
+
+
+# Create a class to do batch inference.
+class LLMPredictor:
+
+    def __init__(self):
+        # Create an LLM.
+        self.llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
+                       tensor_parallel_size=tensor_parallel_size)
+
+    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
+        # Generate texts from the prompts.
+        # The output is a list of RequestOutput objects that contain the prompt,
+        # generated text, and other information.
+        outputs = self.llm.generate(batch["text"], sampling_params)
+        prompt: List[str] = []
+        generated_text: List[str] = []
+        for output in outputs:
+            prompt.append(output.prompt)
+            generated_text.append(' '.join([o.text for o in output.outputs]))
+        return {
+            "prompt": prompt,
+            "generated_text": generated_text,
+        }
+
+
+# Read one text file from S3. Ray Data supports reading multiple files
+# from cloud storage (such as JSONL, Parquet, CSV, binary format).
+ds = ray.data.read_text("s3://anonymous@air-example-data/prompts.txt")
+
+
+# For tensor_parallel_size > 1, we need to create placement groups for vLLM
+# to use. Every actor has to have its own placement group.
+def scheduling_strategy_fn():
+    # One bundle per tensor parallel worker
+    pg = ray.util.placement_group(
+        [{
+            "GPU": 1,
+            "CPU": 1
+        }] * tensor_parallel_size,
+        strategy="STRICT_PACK",
+    )
+    return dict(scheduling_strategy=PlacementGroupSchedulingStrategy(
+        pg, placement_group_capture_child_tasks=True))
+
+
+resources_kwarg: Dict[str, Any] = {}
+if tensor_parallel_size == 1:
+    # For tensor_parallel_size == 1, we simply set num_gpus=1.
+    resources_kwarg["num_gpus"] = 1
+else:
+    # Otherwise, we have to set num_gpus=0 and provide
+    # a function that will create a placement group for
+    # each instance.
+    resources_kwarg["num_gpus"] = 0
+    resources_kwarg["ray_remote_args_fn"] = scheduling_strategy_fn
+
+# Apply batch inference for all input data.
+ds = ds.map_batches(
+    LLMPredictor,
+    # Set the concurrency to the number of LLM instances.
+    concurrency=num_instances,
+    # Specify the batch size for inference.
+    batch_size=32,
+    **resources_kwarg,
+)
+
+
+# Peek first 10 results.
+# NOTE: This is for local testing and debugging. For production use case,
+# one should write full result out as shown below.
+outputs = ds.take(limit=10)
+for output in outputs:
+    prompt = output["prompt"]
+    generated_text = output["generated_text"]
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
+# Write inference output data out as Parquet files to S3.
+# Multiple files would be written to the output destination,
+# and each task would write one or more files separately.
+#
+# ds.write_parquet("s3://<your-output-bucket>")
diff --git a/LLMs/vllm/samples/offline_inference_embedding.py b/LLMs/vllm/samples/offline_inference_embedding.py
new file mode 100644
index 0000000..7d5ef12
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_embedding.py
@@ -0,0 +1,17 @@
+from vllm import LLM
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+
+# Create an LLM.
+model = LLM(model="intfloat/e5-mistral-7b-instruct", enforce_eager=True)
+# Generate embedding. The output is a list of EmbeddingRequestOutputs.
+outputs = model.encode(prompts)
+# Print the outputs.
+for output in outputs:
+    print(output.outputs.embedding)  # list of 4096 floats
diff --git a/LLMs/vllm/samples/offline_inference_pixtral.py b/LLMs/vllm/samples/offline_inference_pixtral.py
new file mode 100644
index 0000000..c12ff70
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_pixtral.py
@@ -0,0 +1,165 @@
+# ruff: noqa
+import argparse
+
+from vllm import LLM
+from vllm.sampling_params import SamplingParams
+
+# This script is an offline demo for running Pixtral.
+#
+# If you want to run a server/client setup, please follow this code:
+#
+# - Server:
+#
+# ```bash
+# vllm serve mistralai/Pixtral-12B-2409 --tokenizer-mode mistral --limit-mm-per-prompt 'image=4' --max-model-len 16384
+# ```
+#
+# - Client:
+#
+# ```bash
+# curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
+# --header 'Content-Type: application/json' \
+# --header 'Authorization: Bearer token' \
+# --data '{
+#     "model": "mistralai/Pixtral-12B-2409",
+#     "messages": [
+#       {
+#         "role": "user",
+#         "content": [
+#             {"type" : "text", "text": "Describe this image in detail please."},
+#             {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}},
+#             {"type" : "text", "text": "and this one as well. Answer in French."},
+#             {"type": "image_url", "image_url": {"url": "https://www.wolframcloud.com/obj/resourcesystem/images/a0e/a0ee3983-46c6-4c92-b85d-059044639928/6af8cfb971db031b.png"}}
+#         ]
+#       }
+#     ]
+#   }'
+# ```
+#
+# Usage:
+#     python demo.py simple
+#     python demo.py advanced
+
+
+def run_simple_demo():
+    model_name = "mistralai/Pixtral-12B-2409"
+    sampling_params = SamplingParams(max_tokens=8192)
+
+    # Lower max_num_seqs or max_model_len on low-VRAM GPUs.
+    llm = LLM(model=model_name, tokenizer_mode="mistral")
+
+    prompt = "Describe this image in one sentence."
+    image_url = "https://picsum.photos/id/237/200/300"
+
+    messages = [
+        {
+            "role":
+            "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": image_url
+                    }
+                },
+            ],
+        },
+    ]
+    outputs = llm.chat(messages, sampling_params=sampling_params)
+
+    print(outputs[0].outputs[0].text)
+
+
+def run_advanced_demo():
+    model_name = "mistralai/Pixtral-12B-2409"
+    max_img_per_msg = 5
+    max_tokens_per_img = 4096
+
+    sampling_params = SamplingParams(max_tokens=8192, temperature=0.7)
+    llm = LLM(
+        model=model_name,
+        tokenizer_mode="mistral",
+        limit_mm_per_prompt={"image": max_img_per_msg},
+        max_model_len=max_img_per_msg * max_tokens_per_img,
+    )
+
+    prompt = "Describe the following image."
+
+    url_1 = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
+    url_2 = "https://picsum.photos/seed/picsum/200/300"
+    url_3 = "https://picsum.photos/id/32/512/512"
+
+    messages = [
+        {
+            "role":
+            "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_1
+                    }
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_2
+                    }
+                },
+            ],
+        },
+        {
+            "role": "assistant",
+            "content": "The images show nature.",
+        },
+        {
+            "role": "user",
+            "content": "More details please and answer only in French!.",
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_3
+                    }
+                },
+            ],
+        },
+    ]
+
+    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
+    print(outputs[0].outputs[0].text)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run a demo in simple or advanced mode.")
+
+    parser.add_argument(
+        "mode",
+        choices=["simple", "advanced"],
+        help="Specify the demo mode: 'simple' or 'advanced'",
+    )
+
+    args = parser.parse_args()
+
+    if args.mode == "simple":
+        print("Running simple demo...")
+        run_simple_demo()
+    elif args.mode == "advanced":
+        print("Running advanced demo...")
+        run_advanced_demo()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/LLMs/vllm/samples/offline_inference_speculator.py b/LLMs/vllm/samples/offline_inference_speculator.py
new file mode 100644
index 0000000..5dec4a7
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_speculator.py
@@ -0,0 +1,58 @@
+import gc
+import time
+from typing import List
+
+from vllm import LLM, SamplingParams
+
+
+def time_generation(llm: LLM, prompts: List[str],
+                    sampling_params: SamplingParams):
+    # Generate texts from the prompts. The output is a list of RequestOutput
+    # objects that contain the prompt, generated text, and other information.
+    # Warmup first
+    llm.generate(prompts, sampling_params)
+    llm.generate(prompts, sampling_params)
+    start = time.time()
+    outputs = llm.generate(prompts, sampling_params)
+    end = time.time()
+    print((end - start) / sum([len(o.outputs[0].token_ids) for o in outputs]))
+    # Print the outputs.
+    for output in outputs:
+        generated_text = output.outputs[0].text
+        print(f"text: {generated_text!r}")
+
+
+if __name__ == "__main__":
+
+    template = (
+        "Below is an instruction that describes a task. Write a response "
+        "that appropriately completes the request.\n\n### Instruction:\n{}"
+        "\n\n### Response:\n")
+
+    # Sample prompts.
+    prompts = [
+        "Write about the president of the United States.",
+    ]
+    prompts = [template.format(prompt) for prompt in prompts]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.0, max_tokens=200)
+
+    # Create an LLM without spec decoding
+    llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
+
+    print("Without speculation")
+    time_generation(llm, prompts, sampling_params)
+
+    del llm
+    gc.collect()
+
+    # Create an LLM with spec decoding
+    llm = LLM(
+        model="meta-llama/Llama-2-13b-chat-hf",
+        speculative_model="ibm-fms/llama-13b-accelerator",
+        # These are currently required for MLPSpeculator decoding
+        use_v2_block_manager=True,
+    )
+
+    print("With speculation")
+    time_generation(llm, prompts, sampling_params)
diff --git a/LLMs/vllm/samples/offline_inference_vision_language_multi_image.py b/LLMs/vllm/samples/offline_inference_vision_language_multi_image.py
new file mode 100644
index 0000000..3b509cc
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_vision_language_multi_image.py
@@ -0,0 +1,333 @@
+from argparse import Namespace
+from typing import List, NamedTuple, Optional
+
+from PIL.Image import Image
+from transformers import AutoProcessor, AutoTokenizer
+
+from vllm import LLM, SamplingParams
+from vllm.multimodal.utils import fetch_image
+from vllm.utils import FlexibleArgumentParser
+
+
+
+QUESTION = "What is the content of each image?"
+IMAGE_URLS = [
+    "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg",
+    "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg",
+]
+
+
+class ModelRequestData(NamedTuple):
+    llm: LLM
+    prompt: str
+    stop_token_ids: Optional[List[str]]
+    image_data: List[Image]
+    chat_template: Optional[str]
+
+
+# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
+# lower-end GPUs.
+# Unless specified, these settings have been tested to work on a single L4.
+
+
+def load_qwenvl_chat(question: str, image_urls: List[str]) -> ModelRequestData:
+    model_name = "Qwen/Qwen-VL-Chat"
+    llm = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        max_model_len=1024,
+        max_num_seqs=2,
+        limit_mm_per_prompt={"image": len(image_urls)},
+    )
+    placeholders = "".join(f"Picture {i}: <img></img>\n"
+                           for i, _ in enumerate(image_urls, start=1))
+
+    # This model does not have a chat_template attribute on its tokenizer,
+    # so we need to explicitly pass it. We use ChatML since it's used in the
+    # generation utils of the model:
+    # https://huggingface.co/Qwen/Qwen-VL-Chat/blob/main/qwen_generation_utils.py#L265
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_name,
+        trust_remote_code=True
+    )
+
+    # Copied from: https://huggingface.co/docs/transformers/main/en/chat_templating
+    chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"  # noqa: E501
+
+    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]
+    prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        chat_template=chat_template
+    )
+
+    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>"]
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+    return ModelRequestData(
+        llm=llm,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+        image_data=[fetch_image(url) for url in image_urls],
+        chat_template=chat_template,
+    )
+
+
+def load_phi3v(question: str, image_urls: List[str]) -> ModelRequestData:
+    # num_crops is an override kwarg to the multimodal image processor;
+    # For some models, e.g., Phi-3.5-vision-instruct, it is recommended
+    # to use 16 for single frame scenarios, and 4 for multi-frame.
+    #
+    # Generally speaking, a larger value for num_crops results in more
+    # tokens per image instance, because it may scale the image more in
+    # the image preprocessing. Some references in the model docs and the
+    # formula for image tokens after the preprocessing
+    # transform can be found below.
+    #
+    # https://huggingface.co/microsoft/Phi-3.5-vision-instruct#loading-the-model-locally
+    # https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L194
+    llm = LLM(
+        model="microsoft/Phi-3.5-vision-instruct",
+        trust_remote_code=True,
+        max_model_len=4096,
+        max_num_seqs=2,
+        limit_mm_per_prompt={"image": len(image_urls)},
+        mm_processor_kwargs={"num_crops": 4},
+    )
+    placeholders = "\n".join(f"<|image_{i}|>"
+                             for i, _ in enumerate(image_urls, start=1))
+    prompt = f"<|user|>\n{placeholders}\n{question}<|end|>\n<|assistant|>\n"
+    stop_token_ids = None
+
+    return ModelRequestData(
+        llm=llm,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+        image_data=[fetch_image(url) for url in image_urls],
+        chat_template=None,
+    )
+
+
+def load_internvl(question: str, image_urls: List[str]) -> ModelRequestData:
+    model_name = "OpenGVLab/InternVL2-2B"
+
+    llm = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        max_model_len=4096,
+        limit_mm_per_prompt={"image": len(image_urls)},
+        mm_processor_kwargs={"max_dynamic_patch": 4},
+    )
+
+    placeholders = "\n".join(f"Image-{i}: <image>\n"
+                             for i, _ in enumerate(image_urls, start=1))
+    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              trust_remote_code=True)
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+
+    # Stop tokens for InternVL
+    # models variants may have different stop tokens
+    # please refer to the model card for the correct "stop words":
+    # https://huggingface.co/OpenGVLab/InternVL2-2B#service
+    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+
+    return ModelRequestData(
+        llm=llm,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+        image_data=[fetch_image(url) for url in image_urls],
+        chat_template=None,
+    )
+
+
+def load_nvlm_d(question: str, image_urls: List[str]):
+    model_name = "nvidia/NVLM-D-72B"
+
+    # Adjust this as necessary to fit in GPU
+    llm = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        max_model_len=8192,
+        tensor_parallel_size=4,
+        limit_mm_per_prompt={"image": len(image_urls)},
+        mm_processor_kwargs={"max_dynamic_patch": 4},
+    )
+
+    placeholders = "\n".join(f"Image-{i}: <image>\n"
+                             for i, _ in enumerate(image_urls, start=1))
+    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              trust_remote_code=True)
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+    stop_token_ids = None
+
+    return ModelRequestData(
+        llm=llm,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+        image_data=[fetch_image(url) for url in image_urls],
+        chat_template=None,
+    )
+
+
+def load_qwen2_vl(question, image_urls: List[str]) -> ModelRequestData:
+    try:
+        from qwen_vl_utils import process_vision_info
+    except ModuleNotFoundError:
+        print('WARNING: `qwen-vl-utils` not installed, input images will not '
+              'be automatically resized. You can enable this functionality by '
+              '`pip install qwen-vl-utils`.')
+        process_vision_info = None
+
+    model_name = "Qwen/Qwen2-VL-7B-Instruct"
+
+    # Tested on L40
+    llm = LLM(
+        model=model_name,
+        max_model_len=32768 if process_vision_info is None else 4096,
+        max_num_seqs=5,
+        limit_mm_per_prompt={"image": len(image_urls)},
+    )
+
+    placeholders = [{"type": "image", "image": url} for url in image_urls]
+    messages = [{
+        "role": "system",
+        "content": "You are a helpful assistant."
+    }, {
+        "role":
+        "user",
+        "content": [
+            *placeholders,
+            {
+                "type": "text",
+                "text": question
+            },
+        ],
+    }]
+
+    processor = AutoProcessor.from_pretrained(model_name)
+
+    prompt = processor.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+
+    stop_token_ids = None
+
+    if process_vision_info is None:
+        image_data = [fetch_image(url) for url in image_urls]
+    else:
+        image_data, _ = process_vision_info(messages)
+
+    return ModelRequestData(
+        llm=llm,
+        prompt=prompt,
+        stop_token_ids=stop_token_ids,
+        image_data=image_data,
+        chat_template=None,
+    )
+
+
+model_example_map = {
+    "phi3_v": load_phi3v,
+    "internvl_chat": load_internvl,
+    "NVLM_D": load_nvlm_d,
+    "qwen2_vl": load_qwen2_vl,
+    "qwen_vl_chat": load_qwenvl_chat,
+}
+
+
+def run_generate(model, question: str, image_urls: List[str]):
+    req_data = model_example_map[model](question, image_urls)
+
+    sampling_params = SamplingParams(
+        temperature=0.0,
+        max_tokens=128,
+        stop_token_ids=req_data.stop_token_ids
+    )
+
+    outputs = req_data.llm.generate(
+        {
+            "prompt": req_data.prompt,
+            "multi_modal_data": {
+                "image": req_data.image_data
+            },
+        },
+        sampling_params=sampling_params)
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+def run_chat(model: str, question: str, image_urls: List[str]):
+    req_data = model_example_map[model](question, image_urls)
+
+    sampling_params = SamplingParams(
+        temperature=0.0,
+        max_tokens=128,
+        stop_token_ids=req_data.stop_token_ids
+    )
+    outputs = req_data.llm.chat(
+        [{
+            "role":
+            "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": question,
+                },
+                *({
+                    "type": "image_url",
+                    "image_url": {
+                        "url": image_url
+                    },
+                } for image_url in image_urls),
+            ],
+        }],
+        sampling_params=sampling_params,
+        chat_template=req_data.chat_template,
+    )
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+def main(args: Namespace):
+    model = args.model_type
+    method = args.method
+
+    if method == "generate":
+        run_generate(model, QUESTION, IMAGE_URLS)
+    elif method == "chat":
+        run_chat(model, QUESTION, IMAGE_URLS)
+    else:
+        raise ValueError(f"Invalid method: {method}")
+
+
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser(
+        description='Demo on using vLLM for offline inference with '
+        'vision language models that support multi-image input')
+    parser.add_argument('--model-type',
+                        '-m',
+                        type=str,
+                        default="phi3_v",
+                        choices=model_example_map.keys(),
+                        help='Huggingface "model_type".')
+    parser.add_argument("--method",
+                        type=str,
+                        default="generate",
+                        choices=["generate", "chat"],
+                        help="The method to run in `vllm.LLM`.")
+
+    args = parser.parse_args()
+    main(args)
diff --git a/LLMs/vllm/samples/offline_inference_with_profiler.py b/LLMs/vllm/samples/offline_inference_with_profiler.py
new file mode 100644
index 0000000..1f00d26
--- /dev/null
+++ b/LLMs/vllm/samples/offline_inference_with_profiler.py
@@ -0,0 +1,33 @@
+import os
+
+from vllm import LLM, SamplingParams
+
+# enable torch profiler, can also be set on cmd line
+os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
+
+llm.start_profile()
+
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+
+llm.stop_profile()
+
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
diff --git a/LLMs/vllm/samples/run_cluster.sh b/LLMs/vllm/samples/run_cluster.sh
new file mode 100755
index 0000000..8e4aa59
--- /dev/null
+++ b/LLMs/vllm/samples/run_cluster.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+
+# Check for minimum number of required arguments
+if [ $# -lt 4 ]; then
+    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
+    exit 1
+fi
+
+# Assign the first three arguments and shift them away
+DOCKER_IMAGE="$1"
+HEAD_NODE_ADDRESS="$2"
+NODE_TYPE="$3"  # Should be --head or --worker
+PATH_TO_HF_HOME="$4"
+shift 4
+
+# Additional arguments are passed directly to the Docker command
+ADDITIONAL_ARGS="$@"
+
+# Validate node type
+if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
+    echo "Error: Node type must be --head or --worker"
+    exit 1
+fi
+
+# Define a function to cleanup on EXIT signal
+cleanup() {
+    docker stop node
+    docker rm node
+}
+trap cleanup EXIT
+
+# Command setup for head or worker node
+RAY_START_CMD="ray start --block"
+if [ "${NODE_TYPE}" == "--head" ]; then
+    RAY_START_CMD+=" --head --port=6379"
+else
+    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
+fi
+
+# Run the docker command with the user specified parameters and additional arguments
+docker run \
+    --entrypoint /bin/bash \
+    --network host \
+    --name node \
+    --shm-size 10.24g \
+    --gpus all \
+    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
+    ${ADDITIONAL_ARGS} \
+    "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"