A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
NVIDIA GPU hardware introspection for Kubernetes clusters via MCP
Just-in-Time SRE Diagnostic Agent for NVIDIA GPU Clusters on Kubernetes
k8s-gpu-mcp-server is an ephemeral diagnostic agent that provides surgical,
real-time NVIDIA GPU hardware introspection for Kubernetes clusters via the
Model Context Protocol (MCP).
Unlike traditional monitoring systems, this agent is designed for AI-assisted troubleshooting by SREs debugging complex hardware failures that standard Kubernetes APIs cannot detect.
Click the button above to install automatically in Cursor.
# Using npx (recommended)
npx k8s-gpu-mcp-server@latest
# Or install globally
npm install -g k8s-gpu-mcp-server
Add to ~/.cursor/mcp.json (Cursor) or VS Code MCP config:
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
# Clone and build
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
# Test with mock GPUs (no hardware required)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=mock
# Test with real GPU (requires NVIDIA driver)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=real
# Deploy with Helm OCI (recommended)
helm install k8s-gpu-mcp-server \
oci://ghcr.io/arangogutierrez/charts/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
# Or from local chart
helm install k8s-gpu-mcp-server ./deployment/helm/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
# Find agent pod on target node
NODE_NAME=<node-name>
POD=$(kubectl get pods -n gpu-diagnostics \
-l app.kubernetes.io/name=k8s-gpu-mcp-server \
--field-selector spec.nodeName=$NODE_NAME \
-o jsonpath='{.items[0].metadata.name}')
# Start diagnostic session
kubectl exec -it -n gpu-diagnostics $POD -- /agent --mode=read-only
Note: GPU access requires
runtimeClassName: nvidiaconfigured by GPU Operator or nvidia-ctk. For clusters without RuntimeClass, use fallback:--set gpu.runtimeClass.enabled=false --set gpu.resourceRequest.enabled=true
For deployed agents, add to your Claude Desktop configuration:
{
"mcpServers": {
"k8s-gpu-agent": {
"command": "kubectl",
"args": ["exec", "-i", "deploy/k8s-gpu-mcp-server", "-n", "gpu-diagnostics", "--", "/agent"]
}
}
}
Then ask Claude: "What's the temperature of the GPUs?"
📖 Full Quick Start Guide → | Kubernetes Deployment →
┌─────────────────────────────────────────────────────────────────────┐
│ MCP Client (Claude/Cursor) │
└────────────────────────────┬────────────────────────────────────────┘
│ stdio / HTTP
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Gateway Pod (:8080) │
│ Router → Circuit Breaker → HTTP Client │
└────────────────────────────┬────────────────────────────────────────┘
│ HTTP (pod-to-pod)
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Agent (Node 1) │ │ Agent (Node 2) │ │ Agent (Node N) │
│ 9 MCP Tools │ │ 9 MCP Tools │ │ 9 MCP Tools │
│ NVML → GPU │ │ NVML → GPU │ │ NVML → GPU │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Design Principles:
📖 Architecture Documentation →
| Tool | Description | Category | Status |
|---|---|---|---|
get_gpu_inventory | Hardware inventory + telemetry | NVML | ✅ Available |
get_gpu_health | GPU health monitoring with scoring | NVML | ✅ Available |
analyze_xid_errors | Parse GPU XID error codes from kernel logs | NVML | ✅ Available |
get_nvlink_topology | NVLink interconnect topology and health | NVML | ✅ Available |
get_gpu_timeline | Historical GPU metrics from flight recorder | NVML + Blackbox | ✅ Available |
describe_gpu_node | Node-level GPU diagnostics with K8s metadata | K8s + NVML | ✅ Available |
get_pod_gpu_allocation | GPU-to-Pod correlation via resource requests | K8s | ✅ Available |
explain_failure | Root cause analysis for failed GPU workloads | K8s + Incidents | ✅ Available |
get_incident_report | Detailed incident report with timeline and snapshots | K8s + Incidents | ✅ Available |
kill_gpu_process | Terminate GPU process | Operator | 🚧 M4 (Operator) |
reset_gpu | GPU reset | Operator | 🚧 M4 (Operator) |
MCP Prompts provide guided diagnostic workflows that orchestrate multiple tools.
See pkg/prompts/prompts.go for prompt definitions.
| Prompt | Description |
|---|---|
gpu-health-check | Comprehensive GPU health assessment with recommendations |
diagnose-xid-errors | Analyze NVIDIA XID errors with remediation guidance |
gpu-triage | Standard SRE triage workflow: inventory → health → XID analysis |
Example usage with Claude:
You: "Run the GPU triage workflow for node gpu-worker-5"
Claude: [Executes gpu-triage prompt]
→ Calls get_gpu_inventory, get_gpu_health, analyze_xid_errors
→ Returns structured triage report with recommendations
| Mode | Flag | Description |
|---|---|---|
| Read-Only (default) | --mode=read-only | All diagnostic tools, no mutations |
| Operator | --mode=operator | Enables future mutating operations (kill process, reset GPU) |
Read-only mode is the default and recommended for most use cases. Operator mode enables future M4 tools that perform write operations on GPUs.
The agent includes a built-in flight recorder (pkg/blackbox) that continuously
captures GPU telemetry (temperature, power, utilization, memory) into per-GPU
ring buffers. This enables tools like get_gpu_timeline and get_incident_report
to query historical GPU metrics around the time of a failure.
The flight recorder starts automatically with the agent and requires no additional configuration. Data is retained in memory for the configured window (default: 30 minutes).
Progress: ~90% Complete (HTTP Transport ✅, Gateway ✅, K8s Tools ✅)
describe_gpu_node, get_pod_gpu_allocation)make test # Run all unit tests (538 tests passing)
make coverage # Generate coverage report
make coverage-html # View coverage in browser
make test-integration # Run on GPU hardware
# Or manually:
go test -tags=integration -v ./pkg/nvml/
Latest Test Results:
✓ 538 total tests passing
✓ Race detector enabled (-race)
✓ Coverage: 58-80% by package
Integration tested on Tesla T4:
- GPU: Tesla T4 (15GB)
- Temperature: 29°C
- Power: 13.9W
- All NVML operations verified
# Build for local platform
make agent
# Build for Linux (with real NVML)
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 make agent
# Build container image
make image
# Multi-arch release builds
make dist
Binary Sizes:
# Run directly with npx
npx k8s-gpu-mcp-server@latest
# Or install globally
npm install -g k8s-gpu-mcp-server
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
sudo mv bin/agent /usr/local/bin/k8s-gpu-mcp-server
go install github.com/ArangoGutierrez/k8s-gpu-mcp-server/cmd/agent@latest
docker pull ghcr.io/arangogutierrez/k8s-gpu-mcp-server:latest
# Install from GHCR OCI registry
helm install k8s-gpu-mcp-server \
oci://ghcr.io/arangogutierrez/charts/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
We welcome contributions! Please see our Development Guide for details.
git checkout -b feat/my-featuremake allgit commit -s -S -m "feat(scope): description"SRE: "Why is the training job on node-5 stuck?"
Claude → k8s-gpu-mcp-server → Detects XID 48 (ECC Error)
Claude: "Node-5 has uncorrectable memory errors. Drain immediately."
SRE: "Are any GPUs thermal throttling?"
Claude → k8s-gpu-mcp-server → Checks temps and throttle status
Claude: "GPU 3 is at 86°C and thermal throttling. Check cooling."
SRE: "Is NVLink properly configured for multi-GPU training?"
Claude → k8s-gpu-mcp-server → Inspects NVLink topology
Claude: "All 8 GPUs connected via NVLink, 600GB/s bandwidth."
SRE: "GPU memory is full but no pods are running"
Claude → k8s-gpu-mcp-server → Lists GPU processes
Claude: "Found zombie process PID 12345 using 8GB. Kill it?"
Apache License 2.0 - See LICENSE for details.
Maintainer: @ArangoGutierrez
Issues: GitHub Issues
Discussions: GitHub Discussions
⭐ Star us on GitHub — it helps!
MCP server integration for DaVinci Resolve Studio
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots
A Jetbrains IDE IntelliJ plugin aimed to provide coding agents the ability to leverage intelliJ's indexing of the codeba