A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
Claude Code skill for autonomous, goal-directed iteration. /autoresearch <goal> builds a real-data benchmark harness, ca
Inspired by Andrej Karpathy's autoresearch — extended into a universal, real-data benchmark-driven workflow for any engineering task.
╔════════════════════════════════════════════════════╗
║ MODIFY → VERIFY → REGRESS → KEEP / DISCARD → ∞ ║
╚════════════════════════════════════════════════════╝
Autoresearch is a Claude Code skill that turns a free-form goal like
/autoresearch reduce API p95 latency to 200ms
into an autonomous, self-correcting optimization loop that:
No hand-holding. No "should I continue?" Just mechanical iteration until the goal is reached.
🎯 Mechanical, not subjectiveEvery iteration is judged by a single floating-point metric extracted from a command. "Looks better" is banned. |
🛡️ Regression-proofA hard gate rolls back any change that drops a pre-existing passing test — no matter how big the win looked. |
📊 Real data onlyThe harness refuses synthetic corpora. If you can't scrape, export, or tail it from reality, the loop won't start. |
♻️ Atomic & reversibleOne change per iteration, git-committed before verification. A failed experiment is always |
🧩 Domain-agnosticBackend latency, test coverage, bundle size, flakiness, LOC, build time, lighthouse scores — same loop, different metric. |
🌐 Global Claude Code skillInstall once under |
# Clone into your Claude Code skills directory
git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git \
~/.claude/skills/autoresearch
On Windows:
git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git `
"$env:USERPROFILE\.claude\skills\autoresearch"
Open Claude Code in any project and type:
/autoresearch reduce avg_latency_ms below 500
Claude will print the parsed slot dump, walk through the five harness phases, then begin the autonomous loop.
Results are appended to autoresearch/results.tsv in your project:
commit metric avg_latency_ms status description
953b71d 1.000000 2008.9 baseline 77-case corpus, 10 signals executed
953b71d 1.000000 646.4 keep parallelize _place_exits with asyncio.gather
953b71d 1.000000 592.1 keep add 30s TTL cache on account()
953b71d 1.000000 488.8 keep prewarm account+connection pool, serialize signals
The skill is plain Markdown — drop it into any agent that supports custom prompts, rules, or instructions. Clone the repo once to a staging location, then symlink or copy into each tool's config path:
git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git ~/autoresearch-skill
On Windows substitute %USERPROFILE%\autoresearch-skill for ~/autoresearch-skill in the commands below.
Codex CLI loads custom prompts from ~/.codex/prompts/.
mkdir -p ~/.codex/prompts
cp ~/autoresearch-skill/SKILL.md ~/.codex/prompts/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.codex/prompts/
Invoke in Codex chat: /autoresearch reduce API p95 to 200ms.
💡 Tip — add
@file ~/.codex/prompts/autoresearch.mdto~/.codex/AGENTS.mdto keep the rubric pre-loaded in every session.
Copilot Chat supports prompt files (*.prompt.md) and instructions files (*.instructions.md).
Project-scoped (checked into the repo):
mkdir -p .github/prompts
cp ~/autoresearch-skill/SKILL.md .github/prompts/autoresearch.prompt.md
cp -r ~/autoresearch-skill/references .github/prompts/
User-scoped (available in every project):
| OS | Path |
|---|---|
| Windows | %APPDATA%\Code\User\prompts\autoresearch.prompt.md |
| macOS | ~/Library/Application Support/Code/User/prompts/autoresearch.prompt.md |
| Linux | ~/.config/Code/User/prompts/autoresearch.prompt.md |
Enable prompt files in VS Code settings.json:
{ "chat.promptFiles": true }
Invoke: /autoresearch <goal> in Copilot Chat.
OpenCode keeps custom commands at ~/.config/opencode/command/.
mkdir -p ~/.config/opencode/command
cp ~/autoresearch-skill/SKILL.md ~/.config/opencode/command/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.config/opencode/command/
Invoke: /autoresearch <goal> in OpenCode.
Cursor loads rules from .cursor/rules/*.mdc.
mkdir -p .cursor/rules
cp ~/autoresearch-skill/SKILL.md .cursor/rules/autoresearch.mdc
cp -r ~/autoresearch-skill/references .cursor/rules/
Prepend this frontmatter to .cursor/rules/autoresearch.mdc:
---
description: Autonomous goal-directed iteration
globs:
alwaysApply: false
---
Invoke: @autoresearch reduce API p95 to 200ms in Cursor Chat (the @rule syntax attaches the rule to the prompt).
For a global install across every project, use Cursor's Settings → Rules → User Rules pane and paste the skill content.
Windsurf loads rules/workflows from .windsurf/rules/.
mkdir -p .windsurf/rules
cp ~/autoresearch-skill/SKILL.md .windsurf/rules/autoresearch.md
cp -r ~/autoresearch-skill/references .windsurf/rules/
Prepend this frontmatter:
---
trigger: manual
description: Autonomous goal-directed iteration
---
Invoke: @autoresearch reduce bundle size below 200KB in Windsurf Chat.
For global scope, drop the files in ~/.windsurf/rules/ instead of the project-local path.
Any extension that respects .github/copilot-instructions.md (e.g. GitHub Copilot Chat, Cline, Roo Code):
mkdir -p .github
cat ~/autoresearch-skill/SKILL.md >> .github/copilot-instructions.md
For Continue:
mkdir -p ~/.continue/commands
cp ~/autoresearch-skill/SKILL.md ~/.continue/commands/autoresearch.md
Then prompt the assistant: "Apply the autoresearch skill to reduce p95 to 200ms" — the full playbook is already in context.
| Tool | Install path | Invocation |
|---|---|---|
| Claude Code | ~/.claude/skills/autoresearch/ | /autoresearch <goal> |
| OpenAI Codex CLI | ~/.codex/prompts/ | /autoresearch <goal> |
| GitHub Copilot | .github/prompts/ · user prompts dir | /autoresearch <goal> |
| OpenCode | ~/.config/opencode/command/ | /autoresearch <goal> |
| Cursor | .cursor/rules/*.mdc | @autoresearch <goal> |
| Windsurf | .windsurf/rules/ | @autoresearch <goal> |
| Continue | ~/.continue/commands/ | /autoresearch <goal> |
| Cline / Roo Code | .github/copilot-instructions.md | natural language |
⚠️ Note —
/autoresearchworks autonomously. For best results use a model with strong agentic and long-context capabilities (Claude Opus/Sonnet 4+, GPT-4.1+, Gemini 2.5 Pro).
flowchart TD
A["/autoresearch [goal]"] --> B[Parse goal → 7 slots]
B --> C{"corpus<br/>required?"}
C -- yes --> D[📦 Phase A<br/>Ingest real data]
C -- no --> E[🛠️ Phase B<br/>Build harness]
D --> E
E --> F[📐 Phase C<br/>Capture baseline]
F --> G[🛡️ Phase D<br/>Run regression suite]
G --> H[🔎 Phase E<br/>Read hot path]
H --> I((LOOP))
I --> J[Modify ONE file]
J --> K[git commit]
K --> L[Run harness]
L --> M[Run regression suite]
M --> N{metric<br/>improved?}
N -- yes --> O[✅ keep + log]
N -- no --> P[🗑️ discard + reset]
M -- regressed --> P
O --> Q{goal<br/>hit?}
P --> I
Q -- no --> I
Q -- yes --> R[🏁 Done]
style A fill:#8A2BE2,stroke:#333,color:#fff
style D fill:#FF6F00,stroke:#333,color:#fff
style E fill:#FF6F00,stroke:#333,color:#fff
style F fill:#FF6F00,stroke:#333,color:#fff
style G fill:#FF6F00,stroke:#333,color:#fff
style H fill:#FF6F00,stroke:#333,color:#fff
style O fill:#2e7d32,stroke:#333,color:#fff
style P fill:#c62828,stroke:#333,color:#fff
style R fill:#1565c0,stroke:#333,color:#fff
| Command | Purpose |
|---|---|
/autoresearch <goal> | Default path — parse free-form goal, build harness, loop until goal met |
/autoresearch | Bare autonomous loop (assumes scope/metric/verify already defined) |
/autoresearch:plan | Interactive wizard: Goal → Scope → Metric → Verify |
/autoresearch:security | Autonomous security audit (STRIDE + OWASP Top 10 + red-team personas) |
Chain with Claude Code's /loop for bounded runs:
/loop 25 /autoresearch reduce bundle size below 200KB
When you type /autoresearch <goal>, Claude extracts seven slots from your free-form text:
| Slot | Example | Fallback |
|---|---|---|
| metric | latency, reliability, coverage, flakiness, bundle size | Ask user |
| direction | reduce/lower/minimise → min · raise/maximise → max | Inferred from noun |
| target | 500ms, 95%, 0%, <200KB | "best achievable" (unbounded) |
| scope | Files matching goal domain terms | Whole repo minus deps |
| corpus_source | prod logs, fixtures, scraped data | Required for empirical metrics |
| verify_cmd | python benchmark.py | Constructed during Phase B |
| regression_cmd | pytest -q, npm test, cargo test, go test ./... | Auto-detected |
Three worked examples:
/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, verify_cmd=python benchmark.py
/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, corpus=CI run history
/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, regression_cmd=pytest -q
The default path runs through five mandatory phases before the loop begins.
sequenceDiagram
autonumber
participant User
participant Claude
participant Repo
participant Tests
User->>Claude: /autoresearch [goal]
Claude->>Claude: Parse goal → 7 slots
Claude->>User: Print parsed-slot dump
Note over Claude,Repo: 📦 Phase A — Corpus Ingestion
Claude->>Repo: Scrape/locate real data
Claude->>Repo: Write autoresearch/data/*.jsonl
Claude-->>User: corpus N cases from source
Note over Claude,Repo: 🛠️ Phase B — Harness Construction
Claude->>Repo: Write benchmark.py (single file)
Claude->>Claude: verify stdout starts with metric:
Note over Claude,Repo: 📐 Phase C — Baseline Capture
Claude->>Repo: Run benchmark.py
Claude->>Repo: Append iteration #0 to results.tsv
Note over Claude,Tests: 🛡️ Phase D — Regression Gate
Claude->>Tests: Run pytest -q (or equivalent)
Claude->>Repo: Record N_pre in .regression-baseline
Note over Claude,Repo: 🔎 Phase E — Hot-Path Reading
Claude->>Repo: Trace entry → handler → I/O
Claude-->>User: 3-5 candidate ideas
loop Until goal hit or interrupted
Claude->>Repo: Modify ONE file
Claude->>Repo: git commit
Claude->>Repo: Run benchmark.py
Claude->>Tests: Run regression suite
alt regression passed && metric improved
Claude->>Repo: status=keep
else regression failed OR metric worse
Claude->>Repo: git reset --hard HEAD~1
Claude->>Repo: status=discard
end
end
Full protocol: references/benchmark-harness.md
|
|
Executed in this project (WhatsApp Signal Trader on Binance testnet):
| Iteration | Change | Metric (avg latency) | Reliability | Status |
|---|---|---|---|---|
| #0 | baseline | 2008.9 ms | 1.000 | baseline |
| #1 | parallelize exit-order placement with asyncio.gather | 646.4 ms | 1.000 | ✅ keep |
| #2 | 30s TTL cache on signed /api/v3/account | 592.1 ms | 1.000 | ✅ keep |
| #3 | prewarm account + TCP pool, serialize signals with Sem(1) | 488.8 ms | 1.000 | ✅ keep |
Result: 2008 ms → 488 ms (76% reduction) with zero regressions across 373 pre-existing tests. Goal of < 500ms achieved in 3 iterations.
autoresearch/
├── SKILL.md # Entry point read by Claude Code
├── README.md # This file
├── LICENSE # MIT
└── references/
├── core-principles.md # 7 generalisable Karpathy principles
├── autonomous-loop-protocol.md # Phase-by-phase loop rules
├── benchmark-harness.md # Corpus + harness + regression gate
├── results-logging.md # TSV schema for iteration logs
├── plan-workflow.md # /autoresearch:plan wizard
└── security-workflow.md # /autoresearch:security audit
| Domain | Metric | Scope | Verify | Corpus Source |
|---|---|---|---|---|
| Backend code | Tests pass + coverage % | src/**/*.ts | npm test | test fixtures |
| Frontend UI | Lighthouse score | src/components/** | npx lighthouse | staging URLs |
| ML training | val_bpb / loss | train.py | uv run train.py | training dataset |
| Blog/content | Word count + readability | content/*.md | custom script | source manuscripts |
| Performance | Benchmark time (ms) | target files | npm run bench | benchmark inputs |
| Refactoring | Tests pass + LOC reduced | target module | npm test && wc -l | existing test suite |
| Security | OWASP + STRIDE coverage | API/auth/middleware | /autoresearch:security | codebase |
| Real-traffic perf | p95 latency (ms) | hot-path files | python benchmark.py | prod log tail |
/autoresearch:security).The core insight from Karpathy that drives every design decision here:
Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
MIT — see LICENSE.
If this skill saves you a milestone's worth of manual tuning, a ⭐ on the repo is appreciated.
Design enforcement with memory — keeps your UI consistent across a project
Claude Code skill for YouTube creators — channel audits, video SEO, retention scripts, thumbnails, content strategy, Sho
AI image generation skill for Claude Code -- Creative Director powered by Gemini
A Claude Code skill by Hao (駱君昊) that learns your Facebook voice and auto-posts to FB / IG / Threads / X with a 14-day c