vllm-mlx

Read this in other languages: English · Español · Français · 中文

Continuous batching + OpenAI + Anthropic APIs in one server. Native Apple Silicon inference.

What is vllm-mlx?

A vLLM-style inference server for Apple Silicon Macs. Unlike Ollama or mlx-lm used directly, it ships continuous batching, paged KV cache, prefix caching, and SSD-tiered cache, and exposes both OpenAI /v1/* and Anthropic /v1/messages from a single process. Run LLMs, vision models, audio, and embeddings on Metal with unified memory, no conversion step.

Quick start (30 seconds)

hljs language-bash

pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

OpenAI SDK:

hljs language-python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)

Anthropic SDK / Claude Code:

hljs language-bash

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Features

APIs

OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses
Anthropic-compatible: /v1/messages (streaming, tool use, system prompts)
MCP Tool Calling: 12 parsers (OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, and more)
Structured output: JSON Schema via response_format (lm-format-enforcer)

Throughput & memory

Continuous batching: high throughput for concurrent requests
Paged KV cache: memory-efficient with prefix sharing
SSD-tiered KV cache: spill prefix cache to disk for long-context agents (--ssd-cache-dir)
Warm prompts: preload popular prefixes at startup (--warm-prompts) for 1.3-2.25x TTFT
Prefix cache: trie-based, shared across requests

Multimodal

Text + image + video + audio from one server
Vision models: Gemma 3, Gemma 4, Qwen3-VL, Pixtral, Llama vision
Audio input in chat (audio_url content blocks)
Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
STT: Whisper family with RTF up to 197x on M4 Max

Reasoning & advanced

Reasoning extraction: Qwen3, DeepSeek-R1 (--reasoning-parser)
MoE expert reduction: --moe-top-k for +7-16% on Qwen3-30B-A3B
Speculative decoding: --mtp for Qwen3-Next
Sparse prefill: attention-based --spec-prefill for TTFT reduction

Observability

Prometheus metrics: /metrics endpoint with --metrics
Built-in benchmarker: vllm-mlx bench-serve for prompt sweeps with CSV/JSON output

Native GPU acceleration

Apple Silicon only (M1, M2, M3, M4, M5) with Metal kernels via MLX
Unified memory, no model conversion

Performance

LLM decode (M4 Max, 128 GB, greedy, single stream):

Model	Tok/s	Memory
Qwen3-0.6B-8bit	417.9	0.7 GB
Llama-3.2-3B-Instruct-4bit	205.6	1.8 GB
Qwen3-30B-A3B-4bit	127.7	~18 GB

Audio speech-to-text (M4 Max, RTF = real-time factor):

Model	RTF	Use case
whisper-tiny	197x	Real-time / low latency
whisper-large-v3-turbo	55x	Quality + speed
whisper-large-v3	24x	Highest accuracy

See docs/benchmarks/ for continuous-batching results, KV-cache quantization (4-bit / 8-bit / fp16), and MoE top-k sweeps.

Examples

Anthropic API (Claude Code, OpenCode)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Reasoning models (Qwen3, DeepSeek-R1)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
)
print("Thinking:", r.choices[0].message.reasoning)
print("Answer:",   r.choices[0].message.content)

Multimodal (image + text)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
    ]}],
)

Structured output (JSON Schema)

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "List 3 colors."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
        },
    },
)

Reranking (`/v1/rerank`)

hljs language-bash

curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
  "model": "default",
  "query": "apple silicon inference",
  "documents": ["MLX is Apples framework", "Metal kernels on M-series", "CUDA on NVIDIA"]
}'

The built-in MLX reranker forward path supports standard BERT/XLM-RoBERTa sequence-classification weights with gelu, gelu_new/gelu_fast, relu, or silu/swish hidden_act values. Other activations fail explicitly so custom reranker architectures can add a dedicated adapter instead of silently using the wrong activation.

Embeddings

hljs language-bash

vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

hljs language-python

emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hello", "World"])

Audio (TTS / STT)

hljs language-bash

pip install vllm-mlx[audio]
brew install espeak-ng        # macOS, needed for non-English TTS

python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play

Built-in benchmarking

hljs language-bash

vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv

# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json

# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db

Model acquisition and conversion

hljs language-bash

# Inspect repo metadata, file sizes, config, and rough fit before downloading weights
vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit

# Acquire with resumable Hugging Face transfer and write a local artifact manifest
vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit --target-dir ./models/llama-3b-4bit

# Wrap mlx-lm conversion and record the exact recipe in the converted artifact
vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct --output ./models/llama-3b-mlx-q4 --quantize --q-bits 4 --q-group-size 64 --q-mode affine

Prometheus metrics

hljs language-bash

vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics

Installation

Using uv (recommended):

hljs language-bash

uv tool install vllm-mlx                 # CLI, system-wide
# or in a project
uv pip install vllm-mlx

Using pip:

hljs language-bash

pip install vllm-mlx

# Audio extras
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm

From source:

hljs language-bash

git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

See Installation Guide for full options.

Documentation

Getting started: Installation · Quick Start
Servers & APIs: OpenAI server · Anthropic Messages API · Python API
Features: Multimodal · Audio · Embeddings · Reasoning · MCP & Tool Calling · Tool Parsers
Performance: Continuous Batching · Multi-Model Serving · Warm Prompts · MoE Top-K
Reference: CLI · Models · Configuration
Benchmarks: LLM · Image · Video · Audio

Architecture

hljs language-bash

┌─────────────────────────────────────────────────────────────────────────┐
│                           vllm-mlx Server                               │
│   OpenAI /v1/*  ·  Anthropic /v1/messages  ·  /v1/rerank  ·  /metrics   │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Continuous batching · Paged KV cache · Prefix cache · SSD tiering      │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│    (LLMs)     │ │  (Vision)     │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                   MLX · Metal kernels · Unified memory                  │
└─────────────────────────────────────────────────────────────────────────┘

Contributing

Bug fixes, perf work, docs, and benchmarks on different Apple Silicon chips all welcome. See the Contributing Guide.

License

Apache 2.0. See LICENSE.

Citation

hljs language-bibtex

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title  = {vllm-mlx: Apple Silicon MLX Backend for vLLM},
  year   = {2025},
  url    = {https://github.com/waybarrios/vllm-mlx},
  note   = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

MLX. Apple's ML framework.
mlx-lm. LLM inference library.
mlx-vlm. Vision-language models.
mlx-audio. Text-to-Speech and Speech-to-Text.
mlx-embeddings. Text embeddings.
Rapid-MLX. Community fork of vllm-mlx.
vLLM. High-throughput LLM serving. vllm-mlx is inspired by vLLM and adopts its continuous-batching and paged KV-cache design for Apple Silicon via MLX.

Star history

If vllm-mlx helped you, please star the repo. It helps more Apple Silicon devs find it.

vllm-mlx

Read this in other languages: English · Español · Français · 中文

Continuous batching + OpenAI + Anthropic APIs in one server. Native Apple Silicon inference.

What is vllm-mlx?

Quick start (30 seconds)

hljs language-bash

pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

OpenAI SDK:

hljs language-python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)

Anthropic SDK / Claude Code:

hljs language-bash

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Features

APIs

OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses
Anthropic-compatible: /v1/messages (streaming, tool use, system prompts)
MCP Tool Calling: 12 parsers (OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, and more)
Structured output: JSON Schema via response_format (lm-format-enforcer)

Throughput & memory

Continuous batching: high throughput for concurrent requests
Paged KV cache: memory-efficient with prefix sharing
SSD-tiered KV cache: spill prefix cache to disk for long-context agents (--ssd-cache-dir)
Warm prompts: preload popular prefixes at startup (--warm-prompts) for 1.3-2.25x TTFT
Prefix cache: trie-based, shared across requests

Multimodal

Text + image + video + audio from one server
Vision models: Gemma 3, Gemma 4, Qwen3-VL, Pixtral, Llama vision
Audio input in chat (audio_url content blocks)
Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
STT: Whisper family with RTF up to 197x on M4 Max

Reasoning & advanced

Reasoning extraction: Qwen3, DeepSeek-R1 (--reasoning-parser)
MoE expert reduction: --moe-top-k for +7-16% on Qwen3-30B-A3B
Speculative decoding: --mtp for Qwen3-Next
Sparse prefill: attention-based --spec-prefill for TTFT reduction

Observability

Prometheus metrics: /metrics endpoint with --metrics
Built-in benchmarker: vllm-mlx bench-serve for prompt sweeps with CSV/JSON output

Native GPU acceleration

Apple Silicon only (M1, M2, M3, M4, M5) with Metal kernels via MLX
Unified memory, no model conversion

Performance

LLM decode (M4 Max, 128 GB, greedy, single stream):

Model	Tok/s	Memory
Qwen3-0.6B-8bit	417.9	0.7 GB
Llama-3.2-3B-Instruct-4bit	205.6	1.8 GB
Qwen3-30B-A3B-4bit	127.7	~18 GB

Audio speech-to-text (M4 Max, RTF = real-time factor):

Model	RTF	Use case
whisper-tiny	197x	Real-time / low latency
whisper-large-v3-turbo	55x	Quality + speed
whisper-large-v3	24x	Highest accuracy

See docs/benchmarks/ for continuous-batching results, KV-cache quantization (4-bit / 8-bit / fp16), and MoE top-k sweeps.

Examples

Anthropic API (Claude Code, OpenCode)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Reasoning models (Qwen3, DeepSeek-R1)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
)
print("Thinking:", r.choices[0].message.reasoning)
print("Answer:",   r.choices[0].message.content)

Multimodal (image + text)

hljs language-bash

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
    ]}],
)

Structured output (JSON Schema)

hljs language-python

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "List 3 colors."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
        },
    },
)

Reranking (`/v1/rerank`)

hljs language-bash

curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
  "model": "default",
  "query": "apple silicon inference",
  "documents": ["MLX is Apples framework", "Metal kernels on M-series", "CUDA on NVIDIA"]
}'

Embeddings

hljs language-bash

vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

hljs language-python

emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hello", "World"])

Audio (TTS / STT)

hljs language-bash

pip install vllm-mlx[audio]
brew install espeak-ng        # macOS, needed for non-English TTS

python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play

Built-in benchmarking

hljs language-bash

vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv

# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json

# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db

Model acquisition and conversion

hljs language-bash

# Inspect repo metadata, file sizes, config, and rough fit before downloading weights
vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit

# Acquire with resumable Hugging Face transfer and write a local artifact manifest
vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit --target-dir ./models/llama-3b-4bit

# Wrap mlx-lm conversion and record the exact recipe in the converted artifact
vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct --output ./models/llama-3b-mlx-q4 --quantize --q-bits 4 --q-group-size 64 --q-mode affine

Prometheus metrics

hljs language-bash

vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics

Installation

Using uv (recommended):

hljs language-bash

uv tool install vllm-mlx                 # CLI, system-wide
# or in a project
uv pip install vllm-mlx

Using pip:

hljs language-bash

pip install vllm-mlx

# Audio extras
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm

From source:

hljs language-bash

git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

See Installation Guide for full options.

Documentation

Getting started: Installation · Quick Start
Servers & APIs: OpenAI server · Anthropic Messages API · Python API
Features: Multimodal · Audio · Embeddings · Reasoning · MCP & Tool Calling · Tool Parsers
Performance: Continuous Batching · Multi-Model Serving · Warm Prompts · MoE Top-K
Reference: CLI · Models · Configuration
Benchmarks: LLM · Image · Video · Audio

Architecture

hljs language-bash

┌─────────────────────────────────────────────────────────────────────────┐
│                           vllm-mlx Server                               │
│   OpenAI /v1/*  ·  Anthropic /v1/messages  ·  /v1/rerank  ·  /metrics   │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Continuous batching · Paged KV cache · Prefix cache · SSD tiering      │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│    (LLMs)     │ │  (Vision)     │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                   MLX · Metal kernels · Unified memory                  │
└─────────────────────────────────────────────────────────────────────────┘

Contributing

Bug fixes, perf work, docs, and benchmarks on different Apple Silicon chips all welcome. See the Contributing Guide.

License

Apache 2.0. See LICENSE.

Citation

hljs language-bibtex

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title  = {vllm-mlx: Apple Silicon MLX Backend for vLLM},
  year   = {2025},
  url    = {https://github.com/waybarrios/vllm-mlx},
  note   = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

MLX. Apple's ML framework.
mlx-lm. LLM inference library.
mlx-vlm. Vision-language models.
mlx-audio. Text-to-Speech and Speech-to-Text.
mlx-embeddings. Text embeddings.
Rapid-MLX. Community fork of vllm-mlx.
vLLM. High-throughput LLM serving. vllm-mlx is inspired by vLLM and adopts its continuous-batching and paged KV-cache design for Apple Silicon via MLX.

Star history

If vllm-mlx helped you, please star the repo. It helps more Apple Silicon devs find it.

vllm-mlx

vllm-mlx

What is vllm-mlx?

Quick start (30 seconds)

Features

APIs

Throughput & memory

Multimodal

Reasoning & advanced

Observability

Native GPU acceleration

Performance

Examples

Anthropic API (Claude Code, OpenCode)

Reasoning models (Qwen3, DeepSeek-R1)

Multimodal (image + text)

Structured output (JSON Schema)

Reranking (/v1/rerank)

Embeddings

Audio (TTS / STT)

Built-in benchmarking

Model acquisition and conversion

Prometheus metrics

Installation

Documentation

Architecture

Contributing

License

Citation

Acknowledgments

Star history

Similar Packages

vllm-mlx

vllm-mlx

What is vllm-mlx?

Quick start (30 seconds)

Features

APIs

Throughput & memory

Multimodal

Reasoning & advanced

Observability

Native GPU acceleration

Performance

Examples

Anthropic API (Claude Code, OpenCode)

Reasoning models (Qwen3, DeepSeek-R1)

Multimodal (image + text)

Structured output (JSON Schema)

Reranking (/v1/rerank)

Embeddings

Audio (TTS / STT)

Built-in benchmarking

Model acquisition and conversion

Prometheus metrics

Installation

Documentation

Architecture

Contributing

License

Citation

Acknowledgments

Star history

Similar Packages

Reranking (`/v1/rerank`)

Reranking (`/v1/rerank`)