Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
42fd8c7
feat(rerank): add Pydantic models for /v1/rerank endpoint
Thump604 Apr 14, 2026
8d32f4c
feat(rerank): add adapter contract and SigmoidAdapter
Thump604 Apr 14, 2026
5679868
feat(ssd-cache): add SSDCacheConfig and SSDCacheStats dataclasses
Thump604 Apr 14, 2026
4a7bb8e
feat(ssd-cache): add SQLite-backed SSDIndex for atomic metadata
Thump604 Apr 14, 2026
9a69478
feat(rerank): add RerankEngine with token-budget batching
Thump604 Apr 14, 2026
90b30c4
feat(ssd-cache): add per-layer serializer interface
Thump604 Apr 14, 2026
c03870f
feat(rerank): add MLX BERT classifier forward pass
Thump604 Apr 14, 2026
50b4325
feat(ssd-cache): add SSDCacheTier core with directory setup
Thump604 Apr 14, 2026
907cf11
feat(rerank): add /v1/rerank endpoint with server wiring
Thump604 Apr 14, 2026
3cbd213
feat(ssd-cache): add async writer thread with atomic spill
Thump604 Apr 14, 2026
4149bc8
feat(rerank): add --rerank-model CLI flag to serve command
Thump604 Apr 14, 2026
6a5b7d4
feat(ssd-cache): wire eviction spill path into MemoryAwarePrefixCache
Thump604 Apr 14, 2026
872235a
test(rerank): add integration tests for /v1/rerank endpoint
Thump604 Apr 14, 2026
1087b13
feat(ssd-cache): add async promote with RAM budget reservation
Thump604 Apr 14, 2026
d060f39
feat(ssd-cache): add disk LRU eviction and startup reconciliation
Thump604 Apr 14, 2026
1313062
feat(ssd-cache): add --ssd-cache-dir and --ssd-cache-max-gb CLI flags
Thump604 Apr 14, 2026
1646aa7
fix(rerank): address code review findings
Thump604 Apr 14, 2026
28824c3
feat(ssd-cache): add scheduler async fetch handoff for cold-tier
Thump604 Apr 14, 2026
d5d087c
test(ssd-cache): add integration tests for spill + fetch round-trip
Thump604 Apr 14, 2026
a98d03f
fix(ssd-cache): address code review findings
Thump604 Apr 14, 2026
6e099a6
fix(ssd-cache): remove unused imports (ruff F401)
Thump604 Apr 14, 2026
a5cc21a
fix(rerank): remove unused imports (ruff F401)
Thump604 Apr 14, 2026
1f28f6d
fix(ssd-cache): remove second unused tempfile import (ruff F401)
Thump604 Apr 14, 2026
55f2417
fix(rerank): unify usage accounting with scoring tokenization
Thump604 Apr 14, 2026
4479900
fix(ssd-cache): correct prefix-hit promotion and remaining tokens
Thump604 Apr 14, 2026
47ea6e7
Add TurboQuant KV cache compression for prefix cache
arozanov Mar 29, 2026
bedab3f
Fix is_trimmable regression: add duck-typing fallback for KVCache/Qua…
arozanov Apr 1, 2026
88bec57
Address review: fix is_trimmable fallback, use packed arrays in memor…
arozanov Apr 1, 2026
6656901
test(memory-cache): align with quantized wrapper
Thump604 Apr 11, 2026
b5dd59d
Address review: fail-fast, trim warning, mutual exclusion
arozanov Apr 16, 2026
06fd71e
Fix failure for test_cancellation_does_not_release_lock_before_worker…
perry2of5 Apr 13, 2026
2f5ce5e
Fix Gemma 4 streaming reasoning parser: buffer partial markers (#355)
janhilgard Apr 16, 2026
d2489c9
fix(api): make response_format actually work on chatty tool-calling m…
janhilgard Apr 16, 2026
d0a4641
Strip markdown code fences from streaming response_format content (#357)
janhilgard Apr 16, 2026
c3d0812
fix: defer MLX imports in SimpleEngine for test isolation (#297)
Thump604 Apr 17, 2026
5e5294f
fix(simple): cooperatively cancel specprefill workers
Thump604 Apr 11, 2026
a7d4d36
test(simple): import mlx for cancellation coverage
Thump604 Apr 11, 2026
b713c35
fix: add missing threading.Event import after rebase
Thump604 Apr 17, 2026
8835494
Merge pull request #280 from Thump604/codex/simpleengine-specprefill-…
Thump604 Apr 17, 2026
d1260dc
Fix garbled output from stale disk-persisted prefix cache
janhilgard Apr 17, 2026
b8dbc35
Add JSON Schema constrained decoding via lm-format-enforcer
janhilgard Apr 17, 2026
e66dfc4
Merge pull request #362 from janhilgard/feat/json-schema-constrained-…
Thump604 Apr 17, 2026
ca71c17
Merge pull request #365 from janhilgard/fix/prefix-cache-version-vali…
Thump604 Apr 17, 2026
be24d95
feat(llm): extend stream_generate to accept prompt_cache and non-str …
Vigilans Apr 2, 2026
6a45698
fix: replace manual model() decode loop with pipelined generation in …
Vigilans Apr 2, 2026
06d8f9d
test: add tests for stream_generate with prompt_cache and non-str prompt
Vigilans Apr 2, 2026
ccf1ec3
fix: add missing mlx_stream_generate import in MLLM _run_specprefill
Vigilans Apr 17, 2026
457ec2d
Merge pull request #248 from Vigilans/fix/specprefill-phase4-decode
Thump604 Apr 17, 2026
97e9e58
chat: forward chat_template_kwargs on simple-engine paths (#218)
krystophny Apr 17, 2026
cbdbcae
server: add OpenAI-compatible /v1/responses endpoint (#214)
krystophny Apr 17, 2026
489418a
bench-serve task 1: add prompt set JSON files
Thump604 Apr 17, 2026
6749aa2
bench-serve task 2: data structures and combinatorial sweep expansion
Thump604 Apr 17, 2026
bbb7136
bench-serve task 3: server auto-detection and hardware fingerprint
Thump604 Apr 17, 2026
31ac28d
bench-serve task 4: SSE streaming core, token counting, and request t…
Thump604 Apr 17, 2026
f6fcfa3
feat(bench-serve): Task 5 — concurrent execution, validation, summary…
Thump604 Apr 17, 2026
985ea29
feat(bench-serve): Task 6 — output formatters (table, JSON, CSV, SQL)
Thump604 Apr 17, 2026
7b71f2a
feat(bench-serve): CLI integration and main orchestrator
Thump604 Apr 17, 2026
31babf7
test(bench-serve): Task 8 — integration smoke test class
Thump604 Apr 17, 2026
dc6192c
fix(bench-serve): review fixes — NaN SQL, package data, import order
Thump604 Apr 17, 2026
e0c167c
fix(bench-serve): remove unused imports (ruff F401)
Thump604 Apr 17, 2026
d4ca8d9
Fix 3 async tests. (#368)
perry2of5 Apr 18, 2026
2ae06f3
Avoid concurrent metal ops to fix 360 (#363)
perry2of5 Apr 18, 2026
3e6bcd4
Change QuantizedKVCache to _QuantizedCacheWrapper to fix tests (#369)
perry2of5 Apr 18, 2026
baedcd3
Ensure the stream generator is closed on the correct thread to fix #3…
perry2of5 Apr 18, 2026
b73e9b3
Fix constrained decoding enforcer stuck + MLLM gpu-memory-utilization…
janhilgard Apr 18, 2026
6d34650
fix(streaming): guarantee data: [DONE] SSE event on all paths (#302)
Thump604 Apr 18, 2026
1a19b66
fix(scheduler): respect --chunked-prefill-tokens 0 when memory cache …
Thump604 Apr 18, 2026
8043dc4
fix(server): stream structured tool calls without parser flags (#304)
Thump604 Apr 18, 2026
ce50521
fix(server): protect Anthropic and utility endpoints (#324)
Thump604 Apr 18, 2026
0ce235c
fix(mllm): block SSRF in remote media fetches (#325)
Thump604 Apr 18, 2026
05b991d
fix(mllm): block local path traversal in multimodal input (#327)
Thump604 Apr 18, 2026
78584f9
fix(server): require explicit trust_remote_code opt-in (#328)
Thump604 Apr 18, 2026
6ebe186
fix(server): reject arbitrary endpoint model loads (#330)
Thump604 Apr 18, 2026
49295c5
fix(mcp): block interpreter inline execution flags (#331)
Thump604 Apr 18, 2026
0aa9f5e
fix(mcp): block high-risk tools by default (#343)
Thump604 Apr 18, 2026
8302c21
fix(server): enforce MCP sandbox on execute endpoint
Thump604 Apr 14, 2026
0b5e3f2
fix(mcp): harden newline and traversal validation
Thump604 Apr 14, 2026
6aa028d
fix(audio): enforce endpoint resource limits
Thump604 Apr 14, 2026
bc8c06f
fix(server): bind localhost by default
Thump604 Apr 14, 2026
e2cd498
fix(auto-parser): support bare bracket tool calls
Thump604 Apr 16, 2026
546dd0c
Merge pull request #329 from Thump604/codex/issue322-mcp-execute-sandbox
Thump604 Apr 18, 2026
173e5bc
Merge pull request #333 from Thump604/codex/issue68-mcp-regex-bypass
Thump604 Apr 18, 2026
250fa2b
Merge pull request #335 from Thump604/codex/issue68-audio-resource-li…
Thump604 Apr 18, 2026
b154ff9
Merge pull request #337 from Thump604/codex/issue68-default-bind-loca…
Thump604 Apr 18, 2026
af9b7c8
fix(server): sanitize logs and error details
Thump604 Apr 14, 2026
0eca7af
fix(server): cap request max_tokens
Thump604 Apr 14, 2026
fa58fbb
fix(test): remove duplicate get_engine() before max_tokens validation
Thump604 Apr 16, 2026
188016b
Merge pull request #305 from Thump604/codex/issue146-bare-bracket-tools
Thump604 Apr 18, 2026
f4df457
Merge pull request #339 from Thump604/codex/issue68-max-tokens
Thump604 Apr 18, 2026
67f3eee
Merge pull request #341 from Thump604/codex/issue68-log-sanitize
Thump604 Apr 18, 2026
023f2c8
Merge pull request #308 from Thump604/feat/reranker-endpoint
Thump604 Apr 18, 2026
1c3ae11
Merge pull request #309 from Thump604/feat/ssd-kv-cache-tiering
Thump604 Apr 18, 2026
827fe13
Merge pull request #366 from waybarrios/feat/bench-serve
Thump604 Apr 18, 2026
c249967
fix: preserve reasoning with non-streaming tool calls
Thump604 Apr 14, 2026
e9fdb11
chore: tighten reasoning helper typing
Thump604 Apr 14, 2026
2645578
style: format reasoning tool call fix
Thump604 Apr 14, 2026
34b0d60
style: format rebased reasoning tool call fix
Thump604 Apr 18, 2026
7fabd73
Merge pull request #315 from Thump604/codex/issue161-reasoning-tool-c…
Thump604 Apr 18, 2026
746be7c
Fix async audio tests @pytest.mark.asyncio -> @pytest.mark.anyio
perry2of5 Apr 18, 2026
5940893
Fix event loop errors in tests.
perry2of5 Apr 18, 2026
a34592f
[NEW FEATURE] --warm-prompts for agent cold-start TTFT + bench-serve …
waybarrios Apr 19, 2026
86e1725
Fix event loop errors in async SSD cache tests
Thump604 Apr 20, 2026
be7f4e8
Fix async audio tests: asyncio -> anyio markers
Thump604 Apr 20, 2026
9224e55
feat(mllm): add audio_url support in chat()
Thump604 Apr 20, 2026
c5e6054
[BUG] gemma4: tool-call stop token, nullable arg parsing, streaming r…
waybarrios Apr 21, 2026
fa5edcc
[BUG] gemma4 CB leaks content across requests via prefix cache LCP (#…
waybarrios Apr 21, 2026
e5ae7de
fix: pass max_tokens to create_anthropic_message to prevent truncatio…
nakedcity Apr 21, 2026
ec1bb7c
stop swallowing CancelledError in the engine and scheduler loops and …
ArdaTX Apr 22, 2026
cddfe04
release 0.2.9
waybarrios Apr 22, 2026
72dfa19
sync FastAPI app version to 0.2.9
waybarrios Apr 22, 2026
0d249de
[DOCS] refresh README and translate docs to ES/FR/ZH (#395)
waybarrios Apr 23, 2026
e3eea88
feat: add lifecycle-managed residency for default server model
lyonsno Apr 23, 2026
6763f82
fix: recompute stale Qwen3.5 MLLM position ids
perry2of5 Apr 23, 2026
d99e7e2
fix: fall back to tokenizer chat template for MLLM processors
perry2of5 Apr 23, 2026
28b4710
fix: strip reasoning markers from tool-call responses
janhilgard Apr 23, 2026
84724d9
fix: stabilize Qwen serving feature stack
Thump604 Apr 23, 2026
bba6ace
Fix MLLM stream ownership and thinking handoff (#399)
Thump604 Apr 23, 2026
9593980
Fix MLLM batch extension for opaque caches (#401)
Thump604 Apr 23, 2026
e2a81a0
fix: keep repeated think blocks out of final content (#402)
Thump604 Apr 23, 2026
879defb
Fix stream cancellation and finished output cleanup (#403)
Thump604 Apr 23, 2026
8156440
Fix MLLM request-local sampling controls
Thump604 Apr 24, 2026
091406d
fix: streaming tool calls drop for Qwen3.6 bracket format (#374)
mikepixelmagic-dev Apr 24, 2026
32006a9
Fix MLX stream thread affinity
waybarrios Apr 24, 2026
4b8b6e0
Fix Qwen MTP RMSNorm weight convention
janhilgard Apr 24, 2026
f529f4d
Support server default chat template kwargs
perry2of5 Apr 24, 2026
86201e0
Stabilize batching performance test warmup
sjswerdloff Apr 24, 2026
02c7d7a
Align pre-commit hooks with CI
perry2of5 Apr 24, 2026
1a77dab
Fix hybrid cache eval in MLLM chunked prefill
sjswerdloff Apr 24, 2026
492ea51
Prevent MLLM preprocessing from blocking event loop (#416)
janhilgard Apr 24, 2026
fa3dd78
Add forced tool_choice, fix response schema nulls, TCP keepalive, and…
janhilgard Apr 24, 2026
db7b86a
Fix O(n^2) performance in JSONSchemaLogitsProcessor
janhilgard Apr 24, 2026
34eb863
Fix tests/test_engine_core_stream_safety.py::test_engine_core_no_cros…
perry2of5 Apr 24, 2026
0dd6419
Fix TimeoutError: tests/test_engine_core_thread_streams.py::test_mllm…
perry2of5 Apr 24, 2026
66ff1cc
Merge pull request #421 from perry2of5/fix/scheduler-logged-cross-thr…
Thump604 Apr 24, 2026
3f2ceb7
Merge pull request #415 from janhilgard/fix/json-logits-processor-o-n…
Thump604 Apr 24, 2026
dcdce95
Merge upstream: tool call fixes, stream safety, SSD cache, warm prompts
arozanov Apr 25, 2026
51ed523
Fix kimi tool parser: func_name extraction + remove dead code
arozanov Apr 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,21 @@ jobs:
tests/test_mllm_cache.py \
tests/test_optimizations.py \
tests/test_simple_engine.py \
tests/test_chat_template_kwargs.py \
tests/test_batching.py \
tests/test_continuous_batching.py \
tests/test_memory_cache_mlx.py \
-v --tb=short \
-m "not slow" \
-k "not Integration"

- name: Run EngineCore stream-affinity regression tests
run: |
# Fresh process so globals like mlx_lm.generate.generation_stream
# are not rebound by earlier tests (see issue #407).
pytest \
tests/test_batching_deterministic.py \
tests/test_engine_core_stream_safety.py \
-v --tb=short \
-m "not slow" \
-k "not Integration"
Expand Down
10 changes: 8 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,14 @@ repos:
rev: v0.1.9
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format
args: [--select, E,F,W, --ignore, E402,E501,E731,F811,F841]

- repo: https://github.com/psf/black
rev: 24.1.1
hooks:
- id: black
args: [--check]
files: ^(vllm_mlx|tests)/

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
Expand Down
302 changes: 302 additions & 0 deletions README.es.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
# vllm-mlx

**Lee esto en otros idiomas:** [English](README.md) · [Español](README.es.md) · [Français](README.fr.md) · [中文](README.zh.md)

**Continuous batching + APIs OpenAI y Anthropic en un solo servidor. Inferencia nativa en Apple Silicon.**

[![PyPI version](https://img.shields.io/pypi/v/vllm-mlx.svg)](https://pypi.org/project/vllm-mlx/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/vllm-mlx.svg)](https://pypi.org/project/vllm-mlx/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Apple Silicon](https://img.shields.io/badge/Apple-Silicon-black.svg)](https://support.apple.com/en-us/HT211814)
[![GitHub stars](https://img.shields.io/github/stars/waybarrios/vllm-mlx.svg?style=social)](https://github.com/waybarrios/vllm-mlx)

---

## ¿Qué es vllm-mlx?

Un servidor de inferencia estilo vLLM para Macs con Apple Silicon. A diferencia de usar `Ollama` o `mlx-lm` directamente, incluye **continuous batching, paged KV cache, prefix caching y KV cache en SSD**, y expone **tanto OpenAI `/v1/*` como Anthropic `/v1/messages`** desde un solo proceso. Corre LLMs, modelos de visión, audio y embeddings sobre Metal con memoria unificada, sin paso de conversión.

## Inicio rápido (30 segundos)

```bash
pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
```

**SDK de OpenAI:**

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hola!"}])
print(r.choices[0].message.content)
```

**SDK de Anthropic / Claude Code:**

```bash
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
```

## Características

### APIs
- **Compatible con OpenAI**: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/rerank`, `/v1/responses`
- **Compatible con Anthropic**: `/v1/messages` (streaming, tool use, system prompts)
- **MCP Tool Calling**: 12 parsers (OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma y más)
- **Salida estructurada**: JSON Schema vía `response_format` (lm-format-enforcer)

### Throughput y memoria
- **Continuous batching**: alto throughput para requests concurrentes
- **Paged KV cache**: eficiente en memoria con prefix sharing
- **KV cache en SSD**: volcá el prefix cache a disco para agentes con contexto largo (`--ssd-cache-dir`)
- **Warm prompts**: precargá prefixes populares al arrancar (`--warm-prompts`) para 1.3-2.25x de TTFT
- **Prefix cache**: basado en trie, compartido entre requests

### Multimodal
- **Texto + imagen + video + audio** desde un solo servidor
- Modelos de visión: Gemma 3, Gemma 4, Qwen3-VL, Pixtral, Llama vision
- **Audio de entrada** en el chat (bloques `audio_url`)
- **TTS nativo**: 11 voces, 15+ idiomas (Kokoro, Chatterbox, VibeVoice, VoxCPM)
- **STT**: familia Whisper con RTF hasta 197x en M4 Max

### Razonamiento y avanzado
- **Extracción de razonamiento**: Qwen3, DeepSeek-R1 (`--reasoning-parser`)
- **Reducción de expertos MoE**: `--moe-top-k` para +7-16% en Qwen3-30B-A3B
- **Decodificación especulativa**: `--mtp` para Qwen3-Next
- **Prefill disperso**: `--spec-prefill` basado en atención para reducir TTFT

### Observabilidad
- **Métricas Prometheus**: endpoint `/metrics` con `--metrics`
- **Benchmarker incluido**: `vllm-mlx bench-serve` para barridos de prompts con salida CSV/JSON

### Aceleración GPU nativa
- Solo Apple Silicon (M1, M2, M3, M4) con kernels Metal vía MLX
- Memoria unificada, sin conversión de modelos

## Rendimiento

**Decode de LLM (M4 Max, 128 GB, greedy, single stream):**

| Modelo | Tok/s | Memoria |
|--------|------:|--------:|
| Qwen3-0.6B-8bit | 417.9 | 0.7 GB |
| Llama-3.2-3B-Instruct-4bit | 205.6 | 1.8 GB |
| Qwen3-30B-A3B-4bit | 127.7 | ~18 GB |

**Audio speech-to-text (M4 Max, RTF = real-time factor):**

| Modelo | RTF | Caso de uso |
|--------|----:|-------------|
| whisper-tiny | 197x | Tiempo real / baja latencia |
| whisper-large-v3-turbo | 55x | Calidad + velocidad |
| whisper-large-v3 | 24x | Máxima precisión |

Ver [docs/benchmarks/](docs/benchmarks/) para resultados de continuous batching, cuantización de KV cache (4-bit / 8-bit / fp16) y barridos de MoE top-k.

## Ejemplos

### API Anthropic (Claude Code, OpenCode)

```bash
vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
```

### Modelos de razonamiento (Qwen3, DeepSeek-R1)

```bash
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
```

```python
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "¿Cuánto es 17 * 23?"}],
)
print("Pensamiento:", r.choices[0].message.reasoning)
print("Respuesta:", r.choices[0].message.content)
```

### Multimodal (imagen + texto)

```bash
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
```

```python
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": [
{"type": "text", "text": "¿Qué hay en esta imagen?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
]}],
)
```

### Salida estructurada (JSON Schema)

```python
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Lista 3 colores."}],
response_format={
"type": "json_schema",
"json_schema": {
"schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
},
},
)
```

### Reranking (`/v1/rerank`)

```bash
curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
"model": "default",
"query": "inferencia en apple silicon",
"documents": ["MLX es el framework de Apple", "Kernels Metal en M-series", "CUDA en NVIDIA"]
}'
```

### Embeddings

```bash
vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
```

```python
emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hola", "Mundo"])
```

### Audio (TTS / STT)

```bash
pip install vllm-mlx[audio]
brew install espeak-ng # macOS, necesario para TTS en idiomas no-inglés

python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play
```

### Benchmarking incluido

```bash
vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv
```

### Métricas Prometheus

```bash
vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics
```

## Instalación

**Usando uv (recomendado):**

```bash
uv tool install vllm-mlx # CLI a nivel sistema
# o dentro de un proyecto
uv pip install vllm-mlx
```

**Usando pip:**

```bash
pip install vllm-mlx

# Extras de audio
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm
```

**Desde el código fuente:**

```bash
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
```

Ver [Guía de instalación](docs/getting-started/installation.md) para todas las opciones.

## Documentación

- **Primeros pasos**: [Instalación](docs/getting-started/installation.md) · [Inicio rápido](docs/getting-started/quickstart.md)
- **Servidores y APIs**: [Servidor OpenAI](docs/guides/server.md) · [API Anthropic Messages](docs/guides/server.md#anthropic-messages-api) · [API Python](docs/guides/python-api.md)
- **Características**: [Multimodal](docs/guides/multimodal.md) · [Audio](docs/guides/audio.md) · [Embeddings](docs/guides/embeddings.md) · [Razonamiento](docs/guides/reasoning.md) · [MCP y Tool Calling](docs/guides/mcp-tools.md) · [Parsers de tools](docs/guides/tool-calling.md)
- **Rendimiento**: [Continuous Batching](docs/guides/continuous-batching.md) · [Warm Prompts](docs/guides/warm-prompts.md) · [MoE Top-K](docs/guides/moe-top-k.md)
- **Referencia**: [CLI](docs/reference/cli.md) · [Modelos](docs/reference/models.md) · [Configuración](docs/reference/configuration.md)
- **Benchmarks**: [LLM](docs/benchmarks/llm.md) · [Imagen](docs/benchmarks/image.md) · [Video](docs/benchmarks/video.md) · [Audio](docs/benchmarks/audio.md)

## Arquitectura

```
┌─────────────────────────────────────────────────────────────────────────┐
│ Servidor vllm-mlx │
│ OpenAI /v1/* · Anthropic /v1/messages · /v1/rerank · /metrics │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ Continuous batching · Paged KV cache · Prefix cache · SSD tiering │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│ (LLMs) │ │ (Visión) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX · kernels Metal · memoria unificada │
└─────────────────────────────────────────────────────────────────────────┘
```

## Contribuir

Bienvenidos bug fixes, trabajo de performance, docs y benchmarks en distintos chips de Apple Silicon. Ver [Guía de contribución](docs/development/contributing.md).

## Licencia

Apache 2.0. Ver [LICENSE](LICENSE).

## Citación

```bibtex
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vllm-mlx: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}
```

## Agradecimientos

- [MLX](https://github.com/ml-explore/mlx). Framework de ML de Apple.
- [mlx-lm](https://github.com/ml-explore/mlx-lm). Librería de inferencia de LLM.
- [mlx-vlm](https://github.com/Blaizzy/mlx-vlm). Modelos de visión y lenguaje.
- [mlx-audio](https://github.com/Blaizzy/mlx-audio). Text-to-Speech y Speech-to-Text.
- [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings). Embeddings de texto.
- [Rapid-MLX](https://github.com/raullenchai/Rapid-MLX). Fork comunitario de vllm-mlx.
- [vLLM](https://github.com/vllm-project/vllm). Servicio de LLM de alto throughput. vllm-mlx está inspirado en vLLM y adopta su diseño de continuous-batching y paged KV-cache para Apple Silicon vía MLX.

## Historia de stars

[![Star History Chart](https://api.star-history.com/svg?repos=waybarrios/vllm-mlx&type=Date)](https://star-history.com/#waybarrios/vllm-mlx&Date)

---

**Si vllm-mlx te sirvió, por favor dale una star al repo. Ayuda a que más devs de Apple Silicon lo encuentren.**
Loading