The native macOS desktop app for local AI on Apple Silicon

vMLX v2 — native Swift + Metal, 50–95 t/s on M-series.
Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.
The Python panel above remains available for legacy support.

Features • Screenshots • API Server • Image Generation • JANG Quantization • Requirements • Build • 한국어

MLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on mlx-community -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on vMLX Engine and Apple's MLX framework.

JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:

Quantization MMLU (200q) Size
JANG_2L (2-bit) 74% 89 GB
MLX 4-bit 26.5% 120 GB
MLX 3-bit 24.5% 93 GB
MLX 2-bit 25% 68 GB

Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at jangq.ai. Models at JANGQ-AI.

Install

Option 1: Download the App (Recommended)

Download the latest DMG -- one file, ready to go.

Download vMLX-X.Y.Z-arm64.dmg
Open the DMG and drag to Applications
Launch -- that's it

All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.

Option 2: CLI via pip (Engine Only)

The vMLX inference engine is published on PyPI as vmlx -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.

hljs language-bash

# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

Note: On macOS 14+, pip install vmlx without a venv will fail with "externally-managed-environment". Use uv, pipx, or create a venv first.

Once running, your local OpenAI-compatible API server is live at http://localhost:8000. Point any OpenAI or Anthropic SDK client at it.

Quick Start

Launch MLX Studio from Applications
Pick a model -- browse HuggingFace models in the Server tab, or enter a repo name (e.g., mlx-community/Qwen3-8B-4bit)
Start the session -- the model downloads automatically and the server starts
Chat -- switch to the Chat tab and start talking

That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.

Screenshots

Chat Interface Streaming conversations with thinking mode, code highlighting, and markdown	Agentic Coding Full tool calling with file I/O, shell execution, and web search
Image Generation & Editing Flux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit	Anthropic API Compatible Drop-in /v1/messages endpoint for Anthropic SDK clients
Developer Tools Convert, inspect, and diagnose models	Model Conversion GGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision
HuggingFace Browser Search and download models directly in-app	Menu Bar Running models, GPU memory, and quick controls

Features

Model Support (65+ Model Families)

Run any MLX model from HuggingFace -- thousands of models, zero configuration:

Text LLMs -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Mistral-Medium-3.5 (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, Laguna (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
Vision LLMs (VL) -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat
Multimodal Omni -- Nemotron-3-Nano-Omni (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across /v1/chat/completions, /v1/messages, /v1/responses, and /api/chat
Mixture-of-Experts -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)
Hybrid SSM Models -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)
Image Generation -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
Image Editing -- Qwen Image Edit (instruction-based editing, full precision)
Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
JANG Models -- Adaptive mixed-precision quantized models from JANGQ-AI, stay quantized in GPU memory via native QuantizedLinear
GGUF Import -- Convert GGUF models to MLX format directly in-app

OpenAI-Compatible API Server

Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:

POST /v1/chat/completions -- Chat Completions API with streaming, tool calling, vision, structured output
POST /v1/responses -- OpenAI Responses API (agentic format) with streaming
POST /v1/completions -- Text completions
POST /v1/images/generations -- Image generation (Flux/Z-Image models, OpenAI format with usage field)
POST /v1/images/edits -- Image editing (Qwen Image Edit, instruction-based)
POST /v1/embeddings -- Text embeddings with dimension control and batch processing
POST /v1/rerank -- Document reranking
POST /v1/audio/speech -- Text-to-speech (Kokoro TTS)
POST /v1/audio/transcriptions -- Speech-to-text (Whisper)
GET /v1/models -- List loaded models
GET /health -- Server health with VRAM usage, queue length, load times

Anthropic API Compatibility

Drop-in replacement for the Anthropic Claude API:

POST /v1/messages -- Anthropic Messages API format
Anthropic SDK tool calling format (auto-translated to internal format)
Vision/multimodal support via Anthropic content blocks
Use the Anthropic Python/TypeScript SDK -- just change the base_url to your local server
Copy-paste code snippets in the API tab for curl, Python, and JavaScript

Tool Calling & Agentic Workflows (14 Parsers)

Auto-detected tool call parsers for every major model family:

Qwen (qwen3, qwen2.5) -- <tool_call> XML format
Llama 3 -- <function=name> format
Mistral -- [TOOL_CALLS] format
Hermes -- <tool_call> JSON format
DeepSeek -- function call blocks
GLM-4.7 -- GLM tool format
MiniMax -- MiniMax function calling
Nemotron -- NVIDIA Nemotron tool format
Granite -- IBM Granite format
Functionary -- Functionary v3 format
XLAM -- Salesforce xLAM format
Kimi -- Moonshot Kimi format
Step-3.5 -- StepFun format
Auto-detection from model_type in config.json with regex name fallback

26+ Built-in Tools:

File I/O -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
Search -- ripgrep file search with regex and glob, glob file finder, unified diff
Execution -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
Web -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
Developer -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
Interactive -- ask_user tool for human-in-the-loop interrupts
Per-category toggles: enable/disable file, search, shell, web tools independently
Auto-continue agent loops (up to 10 tool iterations per request)
MCP (Model Context Protocol) -- connect external tool servers, merge tool definitions, execute MCP tools via API

Reasoning Model Support (4 Parsers)

Collapsible thinking blocks with dedicated parsing for reasoning models:

Qwen3 / Qwen3.5 -- <think>...</think> blocks
DeepSeek-R1 -- DeepSeek reasoning format
OpenAI GPT-OSS / GLM-4.7 -- GPT-OSS thinking format
Phi-4-reasoning -- reasoning content extraction
Enable/disable thinking per request
Reasoning effort control (low/medium/high)
Streaming reasoning content with proper tokenization

Vision & Multimodal (VLM)

Full multimodal input support for vision-language models:

Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
Audio -- Base64 or URL audio input (Qwen3-Audio)
Image detail levels: auto, low, high
Dedicated MLLM cache for image/video embeddings (separate from KV cache)
Send images directly in chat to any VL model

Continuous Batching & Concurrency

Production-grade multi-user serving:

Continuous batching -- handle 32+ concurrent requests with dynamic slot allocation
Prefill batching -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
Completion batching -- batch token generation across sequences
Stream interval control -- configure streaming frequency
Request pooling -- efficiently share GPU memory across concurrent sequences
Rate limiting -- optional per-client request limits
API key authentication -- optional --api-key flag for secured access

5-Layer Cache Stack

Multi-tier caching for maximum throughput and memory efficiency:

L1: Memory-Aware Prefix Cache -- token-level semantic caching with LRU eviction, configurable memory allocation
L1 alt: Paged KV Cache -- block-aware cache with reduced fragmentation for long contexts
L2: Disk Cache -- persistent spillover to disk for large context windows
L2 alt: Block Disk Store -- block-level disk persistence
KV Quantization -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
Hybrid SSM Cache -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
Automatic cache type selection based on model architecture
Cache warming API (POST /v1/cache/warm) for pre-loading common prompts
Cache stats API (GET /v1/cache/stats) for monitoring hit rates and memory usage

Sampling & Generation Control

Full control over text generation:

Temperature (0.0 - 2.0) -- creativity control
Top-P (0.0 - 1.0) -- nucleus sampling
Top-K (integer) -- top-K token filtering
Min-P (0.0 - 1.0) -- minimum probability threshold
Repetition Penalty -- penalize repeated tokens
Stop Sequences -- custom stopping strings
Max Tokens -- output length limit (up to 131072)
Request Timeout -- per-request timeout override
Structured Output -- response_format with json_object or json_schema modes for guaranteed valid JSON
Streaming with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
Usage stats in streaming responses (stream_options.include_usage)

Model Conversion & Quantization

Convert models directly in-app via the Tools tab:

16-bit to MLX -- convert HuggingFace safetensors to MLX format
16-bit to quantized -- quantize to 2-bit, 4-bit, or 8-bit MLX
GGUF to MLX -- import GGUF models into MLX safetensors format
MLX to JANG -- adaptive mixed-precision quantization (different bits per layer type)
Model Inspector -- view config.json, architecture, layer structure
Model Doctor -- diagnostic checks (load test, token count, memory estimation)
Progress tracking with real-time status

Image Generation

Generate images locally with Flux and Z-Image models:

Flux Schnell -- 4-step fast generation
Flux Dev -- 20-step high-quality generation
Z-Image Turbo -- fast turbo generation (4-bit and 8-bit)
Flux Klein -- lightweight 4B parameter model
Flux Kontext -- subject-consistent editing
Flux Krea -- aesthetic fine-tuned generation
Configurable steps, guidance scale, height, width, seed, sampler
Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
Quantized model support (2-bit to 8-bit)
Image gallery with generation history, save, and settings persistence
OpenAI-compatible /v1/images/generations endpoint with usage field

Chat Interface

Full-featured conversation UI:

Persistent history -- SQLite (WAL mode) with full message, metrics, and tool call history
Markdown rendering -- GitHub-flavored markdown with syntax highlighting
Reasoning display -- collapsible thinking sections for reasoning models
Tool call display -- inline tool execution with status and results
Streaming metrics -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
System prompts -- per-chat custom system message
Chat settings -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
Chat folders -- hierarchical organization
Message search -- full-text search across chat history
Export/Import -- ShareGPT format
Voice chat -- STT + TTS integration

Model Management

HuggingFace browser -- search, filter by text/image, and download models directly in-app
Download queue -- multiple concurrent downloads with real-time progress bars and cancel support
Model size display -- file sizes from safetensors metadata before downloading
Local model discovery -- auto-scan ~/.mlxstudio/models, ~/.cache/huggingface/hub, ~/.exo/models, and custom directories
Deduplication -- strict format detection prevents false positive model matches
Zero-config detection -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
65+ model families in the auto-detection registry with two-tier detection (config.json model_type primary, name regex fallback)

Desktop Experience

5 app modes -- Chat, Server, Image, Tools, API
Menu bar tray -- live server status, GPU memory, running models, quick controls
Multi-session -- run multiple models simultaneously on different ports
Dock icon -- restore on click, close-to-tray support
Dark and light themes -- system-respecting
Keyboard shortcuts -- common actions
Toast notifications -- user feedback
Update banner -- new version detection

Advanced Quantization

MLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as JANG adaptive mixed-precision -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.

Convert in-app via the Tools tab, or via CLI: vmlx convert model --jang-profile JANG_3M
Pre-quantized models available at JANGQ-AI on HuggingFace
Stays quantized in GPU memory -- native MLX QuantizedLinear + quantized_matmul
Compatible with all caching layers (prefix, paged, disk, KV quant)

See the vMLX source repo for profiles and conversion details.

Smelt Mode (Partial Expert Loading)

For MoE models that don't fit in RAM, Smelt loads only a subset of experts per layer from SSD and keeps the backbone resident. Response quality stays coherent while RAM usage drops; throughput scales inversely with expert % loaded because expert swaps hit SSD on the hot path.

Benchmarks on Nemotron-Cascade-2-30B-A3B-JANG_4M (23 MoE layers × 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):

`--smelt-experts`	Active RAM	Decode tok/s	RAM saving	Coherent
off (baseline)	17,408 MB	89.9	—	✓
`50`	9,529 MB	66.5	−45%	✓
`25`	5,590 MB	*	−68%	✓

* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.

All three configurations produced coherent, non-looping output. No quality degradation observed.

Credit: Smelt mode is inspired by Anemll's flash-moe — a pure C / Objective‑C / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with pread() on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight — thanks to the flash-moe team for validating the approach.

Smelt is mutually exclusive with VLM mode. MLX Studio / vMLX v1.3.33+ automatically disables --is-mllm when smelt is active (with a warning) because the vision tower is not wired through the partial-expert loader — image input on a smelt-loaded VLM would produce garbage logits. Use a text-only model when running smelt, or disable smelt when running a VLM.

Requires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).

System Requirements

Requirement	Minimum
macOS	14.0 Sonoma or later
Chip	Apple Silicon (M1 / M2 / M3 / M4)
RAM	8 GB (16 GB+ recommended for larger models)
Disk	~500 MB for app; models range from 1-50 GB each

Build from Source

hljs language-bash

git clone https://github.com/jjang-ai/vmlx.git
cd vmlx

# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir   # .app bundle
npx electron-builder --mac dmg     # DMG installer

Links

Resource	Link
Source Code	github.com/jjang-ai/vmlx
PyPI	pypi.org/project/vmlx
MLX Models	huggingface.co/mlx-community
JANG Models	huggingface.co/JANGQ-AI
Website	vmlx.net

License

Apache License 2.0

Built by Jinho Jang • eric@jangq.ai • JANGQ AI • Support on Ko-fi

한국어 (Korean)

MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

Mac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.

JANG 2비트가 MLX 4/3/2비트보다 높은 성능 — 적응형 혼합 정밀도 양자화(JANG_2S, JANG_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. jangq.ai에서 벤치마크 확인. JANGQ-AI에서 사전 양자화 모델 다운로드.

설치: 최신 DMG 다운로드 — 드래그 앤 드롭으로 설치.

주요 기능

기능	설명
채팅	대화 인터페이스, 도구 호출, 에이전트 코딩
이미지 생성	Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein
이미지 편집	Qwen Image Edit (텍스트 지시 기반 편집)
5단계 캐싱	프리픽스, 페이지드, KV 양자화, 디스크 캐시
API 서버	OpenAI + Anthropic 호환 API
30개 도구	파일, 웹 검색, Git, 터미널 내장 도구

개발자: 장진호 (eric@jangq.ai)
JANGQ AI • Ko-fi로 후원하기

The native macOS desktop app for local AI on Apple Silicon

vMLX v2 — native Swift + Metal, 50–95 t/s on M-series.
Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.
The Python panel above remains available for legacy support.

Features • Screenshots • API Server • Image Generation • JANG Quantization • Requirements • Build • 한국어

JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:

Quantization MMLU (200q) Size
JANG_2L (2-bit) 74% 89 GB
MLX 4-bit 26.5% 120 GB
MLX 3-bit 24.5% 93 GB
MLX 2-bit 25% 68 GB

Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at jangq.ai. Models at JANGQ-AI.

Install

Option 1: Download the App (Recommended)

Download the latest DMG -- one file, ready to go.

Download vMLX-X.Y.Z-arm64.dmg
Open the DMG and drag to Applications
Launch -- that's it

All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.

Option 2: CLI via pip (Engine Only)

The vMLX inference engine is published on PyPI as vmlx -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.

hljs language-bash

# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

Note: On macOS 14+, pip install vmlx without a venv will fail with "externally-managed-environment". Use uv, pipx, or create a venv first.

Once running, your local OpenAI-compatible API server is live at http://localhost:8000. Point any OpenAI or Anthropic SDK client at it.

Quick Start

Launch MLX Studio from Applications
Pick a model -- browse HuggingFace models in the Server tab, or enter a repo name (e.g., mlx-community/Qwen3-8B-4bit)
Start the session -- the model downloads automatically and the server starts
Chat -- switch to the Chat tab and start talking

That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.

Screenshots

Chat Interface Streaming conversations with thinking mode, code highlighting, and markdown	Agentic Coding Full tool calling with file I/O, shell execution, and web search
Image Generation & Editing Flux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit	Anthropic API Compatible Drop-in /v1/messages endpoint for Anthropic SDK clients
Developer Tools Convert, inspect, and diagnose models	Model Conversion GGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision
HuggingFace Browser Search and download models directly in-app	Menu Bar Running models, GPU memory, and quick controls

Features

Model Support (65+ Model Families)

Run any MLX model from HuggingFace -- thousands of models, zero configuration:

Text LLMs -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Mistral-Medium-3.5 (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, Laguna (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
Vision LLMs (VL) -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat
Multimodal Omni -- Nemotron-3-Nano-Omni (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across /v1/chat/completions, /v1/messages, /v1/responses, and /api/chat
Mixture-of-Experts -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)
Hybrid SSM Models -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)
Image Generation -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
Image Editing -- Qwen Image Edit (instruction-based editing, full precision)
Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
JANG Models -- Adaptive mixed-precision quantized models from JANGQ-AI, stay quantized in GPU memory via native QuantizedLinear
GGUF Import -- Convert GGUF models to MLX format directly in-app

OpenAI-Compatible API Server

Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:

POST /v1/chat/completions -- Chat Completions API with streaming, tool calling, vision, structured output
POST /v1/responses -- OpenAI Responses API (agentic format) with streaming
POST /v1/completions -- Text completions
POST /v1/images/generations -- Image generation (Flux/Z-Image models, OpenAI format with usage field)
POST /v1/images/edits -- Image editing (Qwen Image Edit, instruction-based)
POST /v1/embeddings -- Text embeddings with dimension control and batch processing
POST /v1/rerank -- Document reranking
POST /v1/audio/speech -- Text-to-speech (Kokoro TTS)
POST /v1/audio/transcriptions -- Speech-to-text (Whisper)
GET /v1/models -- List loaded models
GET /health -- Server health with VRAM usage, queue length, load times

Anthropic API Compatibility

Drop-in replacement for the Anthropic Claude API:

POST /v1/messages -- Anthropic Messages API format
Anthropic SDK tool calling format (auto-translated to internal format)
Vision/multimodal support via Anthropic content blocks
Use the Anthropic Python/TypeScript SDK -- just change the base_url to your local server
Copy-paste code snippets in the API tab for curl, Python, and JavaScript

Tool Calling & Agentic Workflows (14 Parsers)

Auto-detected tool call parsers for every major model family:

Qwen (qwen3, qwen2.5) -- <tool_call> XML format
Llama 3 -- <function=name> format
Mistral -- [TOOL_CALLS] format
Hermes -- <tool_call> JSON format
DeepSeek -- function call blocks
GLM-4.7 -- GLM tool format
MiniMax -- MiniMax function calling
Nemotron -- NVIDIA Nemotron tool format
Granite -- IBM Granite format
Functionary -- Functionary v3 format
XLAM -- Salesforce xLAM format
Kimi -- Moonshot Kimi format
Step-3.5 -- StepFun format
Auto-detection from model_type in config.json with regex name fallback

26+ Built-in Tools:

File I/O -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
Search -- ripgrep file search with regex and glob, glob file finder, unified diff
Execution -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
Web -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
Developer -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
Interactive -- ask_user tool for human-in-the-loop interrupts
Per-category toggles: enable/disable file, search, shell, web tools independently
Auto-continue agent loops (up to 10 tool iterations per request)
MCP (Model Context Protocol) -- connect external tool servers, merge tool definitions, execute MCP tools via API

Reasoning Model Support (4 Parsers)

Collapsible thinking blocks with dedicated parsing for reasoning models:

Qwen3 / Qwen3.5 -- <think>...</think> blocks
DeepSeek-R1 -- DeepSeek reasoning format
OpenAI GPT-OSS / GLM-4.7 -- GPT-OSS thinking format
Phi-4-reasoning -- reasoning content extraction
Enable/disable thinking per request
Reasoning effort control (low/medium/high)
Streaming reasoning content with proper tokenization

Vision & Multimodal (VLM)

Full multimodal input support for vision-language models:

Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
Audio -- Base64 or URL audio input (Qwen3-Audio)
Image detail levels: auto, low, high
Dedicated MLLM cache for image/video embeddings (separate from KV cache)
Send images directly in chat to any VL model

Continuous Batching & Concurrency

Production-grade multi-user serving:

Continuous batching -- handle 32+ concurrent requests with dynamic slot allocation
Prefill batching -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
Completion batching -- batch token generation across sequences
Stream interval control -- configure streaming frequency
Request pooling -- efficiently share GPU memory across concurrent sequences
Rate limiting -- optional per-client request limits
API key authentication -- optional --api-key flag for secured access

5-Layer Cache Stack

Multi-tier caching for maximum throughput and memory efficiency:

L1: Memory-Aware Prefix Cache -- token-level semantic caching with LRU eviction, configurable memory allocation
L1 alt: Paged KV Cache -- block-aware cache with reduced fragmentation for long contexts
L2: Disk Cache -- persistent spillover to disk for large context windows
L2 alt: Block Disk Store -- block-level disk persistence
KV Quantization -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
Hybrid SSM Cache -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
Automatic cache type selection based on model architecture
Cache warming API (POST /v1/cache/warm) for pre-loading common prompts
Cache stats API (GET /v1/cache/stats) for monitoring hit rates and memory usage

Sampling & Generation Control

Full control over text generation:

Temperature (0.0 - 2.0) -- creativity control
Top-P (0.0 - 1.0) -- nucleus sampling
Top-K (integer) -- top-K token filtering
Min-P (0.0 - 1.0) -- minimum probability threshold
Repetition Penalty -- penalize repeated tokens
Stop Sequences -- custom stopping strings
Max Tokens -- output length limit (up to 131072)
Request Timeout -- per-request timeout override
Structured Output -- response_format with json_object or json_schema modes for guaranteed valid JSON
Streaming with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
Usage stats in streaming responses (stream_options.include_usage)

Model Conversion & Quantization

Convert models directly in-app via the Tools tab:

16-bit to MLX -- convert HuggingFace safetensors to MLX format
16-bit to quantized -- quantize to 2-bit, 4-bit, or 8-bit MLX
GGUF to MLX -- import GGUF models into MLX safetensors format
MLX to JANG -- adaptive mixed-precision quantization (different bits per layer type)
Model Inspector -- view config.json, architecture, layer structure
Model Doctor -- diagnostic checks (load test, token count, memory estimation)
Progress tracking with real-time status

Image Generation

Generate images locally with Flux and Z-Image models:

Flux Schnell -- 4-step fast generation
Flux Dev -- 20-step high-quality generation
Z-Image Turbo -- fast turbo generation (4-bit and 8-bit)
Flux Klein -- lightweight 4B parameter model
Flux Kontext -- subject-consistent editing
Flux Krea -- aesthetic fine-tuned generation
Configurable steps, guidance scale, height, width, seed, sampler
Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
Quantized model support (2-bit to 8-bit)
Image gallery with generation history, save, and settings persistence
OpenAI-compatible /v1/images/generations endpoint with usage field

Chat Interface

Full-featured conversation UI:

Persistent history -- SQLite (WAL mode) with full message, metrics, and tool call history
Markdown rendering -- GitHub-flavored markdown with syntax highlighting
Reasoning display -- collapsible thinking sections for reasoning models
Tool call display -- inline tool execution with status and results
Streaming metrics -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
System prompts -- per-chat custom system message
Chat settings -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
Chat folders -- hierarchical organization
Message search -- full-text search across chat history
Export/Import -- ShareGPT format
Voice chat -- STT + TTS integration

Model Management

HuggingFace browser -- search, filter by text/image, and download models directly in-app
Download queue -- multiple concurrent downloads with real-time progress bars and cancel support
Model size display -- file sizes from safetensors metadata before downloading
Local model discovery -- auto-scan ~/.mlxstudio/models, ~/.cache/huggingface/hub, ~/.exo/models, and custom directories
Deduplication -- strict format detection prevents false positive model matches
Zero-config detection -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
65+ model families in the auto-detection registry with two-tier detection (config.json model_type primary, name regex fallback)

Desktop Experience

5 app modes -- Chat, Server, Image, Tools, API
Menu bar tray -- live server status, GPU memory, running models, quick controls
Multi-session -- run multiple models simultaneously on different ports
Dock icon -- restore on click, close-to-tray support
Dark and light themes -- system-respecting
Keyboard shortcuts -- common actions
Toast notifications -- user feedback
Update banner -- new version detection

Advanced Quantization

Convert in-app via the Tools tab, or via CLI: vmlx convert model --jang-profile JANG_3M
Pre-quantized models available at JANGQ-AI on HuggingFace
Stays quantized in GPU memory -- native MLX QuantizedLinear + quantized_matmul
Compatible with all caching layers (prefix, paged, disk, KV quant)

See the vMLX source repo for profiles and conversion details.

Smelt Mode (Partial Expert Loading)

Benchmarks on Nemotron-Cascade-2-30B-A3B-JANG_4M (23 MoE layers × 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):

`--smelt-experts`	Active RAM	Decode tok/s	RAM saving	Coherent
off (baseline)	17,408 MB	89.9	—	✓
`50`	9,529 MB	66.5	−45%	✓
`25`	5,590 MB	*	−68%	✓

* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.

All three configurations produced coherent, non-looping output. No quality degradation observed.

Credit: Smelt mode is inspired by Anemll's flash-moe — a pure C / Objective‑C / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with pread() on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight — thanks to the flash-moe team for validating the approach.

Requires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).

System Requirements

Requirement	Minimum
macOS	14.0 Sonoma or later
Chip	Apple Silicon (M1 / M2 / M3 / M4)
RAM	8 GB (16 GB+ recommended for larger models)
Disk	~500 MB for app; models range from 1-50 GB each

Build from Source

hljs language-bash

git clone https://github.com/jjang-ai/vmlx.git
cd vmlx

# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir   # .app bundle
npx electron-builder --mac dmg     # DMG installer

Links

Resource	Link
Source Code	github.com/jjang-ai/vmlx
PyPI	pypi.org/project/vmlx
MLX Models	huggingface.co/mlx-community
JANG Models	huggingface.co/JANGQ-AI
Website	vmlx.net

License

Apache License 2.0

Built by Jinho Jang • eric@jangq.ai • JANGQ AI • Support on Ko-fi

한국어 (Korean)

MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

Mac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.

JANG 2비트가 MLX 4/3/2비트보다 높은 성능 — 적응형 혼합 정밀도 양자화(JANG_2S, JANG_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. jangq.ai에서 벤치마크 확인. JANGQ-AI에서 사전 양자화 모델 다운로드.

설치: 최신 DMG 다운로드 — 드래그 앤 드롭으로 설치.

주요 기능

기능	설명
채팅	대화 인터페이스, 도구 호출, 에이전트 코딩
이미지 생성	Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein
이미지 편집	Qwen Image Edit (텍스트 지시 기반 편집)
5단계 캐싱	프리픽스, 페이지드, KV 양자화, 디스크 캐시
API 서버	OpenAI + Anthropic 호환 API
30개 도구	파일, 웹 검색, Git, 터미널 내장 도구

개발자: 장진호 (eric@jangq.ai)
JANGQ AI • Ko-fi로 후원하기

Quantization	MMLU (200q)	Size
JANG_2L (2-bit)	74%	89 GB
MLX 4-bit	26.5%	120 GB
MLX 3-bit	24.5%	93 GB
MLX 2-bit	25%	68 GB

mlxstudio

The native macOS desktop app for local AI on Apple Silicon

Install

Option 1: Download the App (Recommended)

Option 2: CLI via pip (Engine Only)

Quick Start

Screenshots

Features

Model Support (65+ Model Families)

OpenAI-Compatible API Server

Anthropic API Compatibility

Tool Calling & Agentic Workflows (14 Parsers)

Reasoning Model Support (4 Parsers)

Vision & Multimodal (VLM)

Continuous Batching & Concurrency

5-Layer Cache Stack

Sampling & Generation Control

Model Conversion & Quantization

Image Generation

Chat Interface

Model Management

Desktop Experience

Advanced Quantization

Smelt Mode (Partial Expert Loading)

System Requirements

Build from Source

Links

License

한국어 (Korean)

MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

주요 기능

Similar Packages

mlxstudio

The native macOS desktop app for local AI on Apple Silicon

Install

Option 1: Download the App (Recommended)

Option 2: CLI via pip (Engine Only)

Quick Start

Screenshots

Features

Model Support (65+ Model Families)

OpenAI-Compatible API Server

Anthropic API Compatibility

Tool Calling & Agentic Workflows (14 Parsers)

Reasoning Model Support (4 Parsers)

Vision & Multimodal (VLM)

Continuous Batching & Concurrency

5-Layer Cache Stack

Sampling & Generation Control

Model Conversion & Quantization

Image Generation

Chat Interface

Model Management

Desktop Experience

Advanced Quantization

Smelt Mode (Partial Expert Loading)

System Requirements

Build from Source

Links

License

한국어 (Korean)

MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

주요 기능

Similar Packages