A community-driven registry for the Claude Code ecosystem. Not affiliated with Anthropic.
Are you the author? Sign in to claim
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) wi
Read this in other languages: English · Español · Français · 中文
Continuous batching + OpenAI + Anthropic APIs in one server. Native Apple Silicon inference.
A vLLM-style inference server for Apple Silicon Macs. Unlike Ollama or mlx-lm used directly, it ships continuous batching, paged KV cache, prefix caching, and SSD-tiered cache, and exposes both OpenAI /v1/* and Anthropic /v1/messages from a single process. Run LLMs, vision models, audio, and embeddings on Metal with unified memory, no conversion step.
pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)
Anthropic SDK / Claude Code:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses/v1/messages (streaming, tool use, system prompts)response_format (lm-format-enforcer)--ssd-cache-dir)--warm-prompts) for 1.3-2.25x TTFTaudio_url content blocks)--reasoning-parser)--moe-top-k for +7-16% on Qwen3-30B-A3B--mtp for Qwen3-Next--spec-prefill for TTFT reduction/metrics endpoint with --metricsvllm-mlx bench-serve for prompt sweeps with CSV/JSON outputLLM decode (M4 Max, 128 GB, greedy, single stream):
| Model | Tok/s | Memory |
|---|---|---|
| Qwen3-0.6B-8bit | 417.9 | 0.7 GB |
| Llama-3.2-3B-Instruct-4bit | 205.6 | 1.8 GB |
| Qwen3-30B-A3B-4bit | 127.7 | ~18 GB |
Audio speech-to-text (M4 Max, RTF = real-time factor):
| Model | RTF | Use case |
|---|---|---|
| whisper-tiny | 197x | Real-time / low latency |
| whisper-large-v3-turbo | 55x | Quality + speed |
| whisper-large-v3 | 24x | Highest accuracy |
See docs/benchmarks/ for continuous-batching results, KV-cache quantization (4-bit / 8-bit / fp16), and MoE top-k sweeps.
vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 * 23?"}],
)
print("Thinking:", r.choices[0].message.reasoning)
print("Answer:", r.choices[0].message.content)
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
]}],
)
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "List 3 colors."}],
response_format={
"type": "json_schema",
"json_schema": {
"schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
},
},
)
/v1/rerank)curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
"model": "default",
"query": "apple silicon inference",
"documents": ["MLX is Apples framework", "Metal kernels on M-series", "CUDA on NVIDIA"]
}'
The built-in MLX reranker forward path supports standard BERT/XLM-RoBERTa
sequence-classification weights with gelu, gelu_new/gelu_fast, relu, or
silu/swish hidden_act values. Other activations fail explicitly so custom
reranker architectures can add a dedicated adapter instead of silently using the
wrong activation.
vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hello", "World"])
pip install vllm-mlx[audio]
brew install espeak-ng # macOS, needed for non-English TTS
python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play
vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv
# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json
# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db
# Inspect repo metadata, file sizes, config, and rough fit before downloading weights
vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit
# Acquire with resumable Hugging Face transfer and write a local artifact manifest
vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit --target-dir ./models/llama-3b-4bit
# Wrap mlx-lm conversion and record the exact recipe in the converted artifact
vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct --output ./models/llama-3b-mlx-q4 --quantize --q-bits 4 --q-group-size 64 --q-mode affine
vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics
Using uv (recommended):
uv tool install vllm-mlx # CLI, system-wide
# or in a project
uv pip install vllm-mlx
Using pip:
pip install vllm-mlx
# Audio extras
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm
From source:
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
See Installation Guide for full options.
┌─────────────────────────────────────────────────────────────────────────┐
│ vllm-mlx Server │
│ OpenAI /v1/* · Anthropic /v1/messages · /v1/rerank · /metrics │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Continuous batching · Paged KV cache · Prefix cache · SSD tiering │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│ (LLMs) │ │ (Vision) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX · Metal kernels · Unified memory │
└─────────────────────────────────────────────────────────────────────────┘
Bug fixes, perf work, docs, and benchmarks on different Apple Silicon chips all welcome. See the Contributing Guide.
Apache 2.0. See LICENSE.
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vllm-mlx: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}
If vllm-mlx helped you, please star the repo. It helps more Apple Silicon devs find it.
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots
Secure MCP server for MySQL database interaction, queries, and schema management
English-first Korean equity intelligence MCP — DART filings, foreign-holder 5%-rule flows, activist filings, KRX news. F