🔬 Autoresearch Skill

Autonomous, Goal-Directed Iteration for Claude Code

Inspired by Andrej Karpathy's autoresearch — extended into a universal, real-data benchmark-driven workflow for any engineering task.

hljs language-markdown


          ╔════════════════════════════════════════════════════╗
          ║   MODIFY → VERIFY → REGRESS → KEEP / DISCARD → ∞   ║
          ╚════════════════════════════════════════════════════╝

✨ What is this?

Autoresearch is a Claude Code skill that turns a free-form goal like

hljs language-bash

/autoresearch reduce API p95 latency to 200ms

into an autonomous, self-correcting optimization loop that:

🧠 Parses the goal into seven machine-readable slots
📦 Ingests real data (refusing synthetic corpora)
🛠️ Builds a single-file benchmark harness
📐 Captures a baseline + regression test count
🔁 Iterates — one atomic change at a time
✅ Keeps wins, 🗑️ auto-discards regressions, logs everything
🏁 Stops when the target metric is hit

No hand-holding. No "should I continue?" Just mechanical iteration until the goal is reached.

🌟 Why use it?

🎯 Mechanical, not subjective

Every iteration is judged by a single floating-point metric extracted from a command. "Looks better" is banned.

🛡️ Regression-proof

A hard gate rolls back any change that drops a pre-existing passing test — no matter how big the win looked.

📊 Real data only

The harness refuses synthetic corpora. If you can't scrape, export, or tail it from reality, the loop won't start.

♻️ Atomic & reversible

One change per iteration, git-committed before verification. A failed experiment is always git reset --hard HEAD~1 away.

🧩 Domain-agnostic

Backend latency, test coverage, bundle size, flakiness, LOC, build time, lighthouse scores — same loop, different metric.

🌐 Global Claude Code skill

Install once under ~/.claude/skills/autoresearch/ and invoke /autoresearch <goal> from any project.

🚀 Quick Start

1. Install (global skill)

hljs language-bash

# Clone into your Claude Code skills directory
git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git \
  ~/.claude/skills/autoresearch

On Windows:

hljs language-powershell

git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git `
  "$env:USERPROFILE\.claude\skills\autoresearch"

2. Invoke

Open Claude Code in any project and type:

hljs language-bash

/autoresearch reduce avg_latency_ms below 500

Claude will print the parsed slot dump, walk through the five harness phases, then begin the autonomous loop.

3. Watch it work

Results are appended to autoresearch/results.tsv in your project:

hljs language-tsv

commit    metric     avg_latency_ms  status      description
953b71d   1.000000   2008.9          baseline    77-case corpus, 10 signals executed
953b71d   1.000000   646.4           keep        parallelize _place_exits with asyncio.gather
953b71d   1.000000   592.1           keep        add 30s TTL cache on account()
953b71d   1.000000   488.8           keep        prewarm account+connection pool, serialize signals

🔌 Use with other AI coding assistants

The skill is plain Markdown — drop it into any agent that supports custom prompts, rules, or instructions. Clone the repo once to a staging location, then symlink or copy into each tool's config path:

hljs language-bash

git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git ~/autoresearch-skill

On Windows substitute %USERPROFILE%\autoresearch-skill for ~/autoresearch-skill in the commands below.

🧠 OpenAI Codex CLI

Codex CLI loads custom prompts from ~/.codex/prompts/.

hljs language-bash

mkdir -p ~/.codex/prompts
cp    ~/autoresearch-skill/SKILL.md  ~/.codex/prompts/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.codex/prompts/

Invoke in Codex chat: /autoresearch reduce API p95 to 200ms.

💡 Tip — add @file ~/.codex/prompts/autoresearch.md to ~/.codex/AGENTS.md to keep the rubric pre-loaded in every session.

🤖 GitHub Copilot (VS Code & JetBrains)

Copilot Chat supports prompt files (*.prompt.md) and instructions files (*.instructions.md).

Project-scoped (checked into the repo):

hljs language-bash

mkdir -p .github/prompts
cp    ~/autoresearch-skill/SKILL.md  .github/prompts/autoresearch.prompt.md
cp -r ~/autoresearch-skill/references .github/prompts/

User-scoped (available in every project):

OS	Path
Windows	`%APPDATA%\Code\User\prompts\autoresearch.prompt.md`
macOS	`~/Library/Application Support/Code/User/prompts/autoresearch.prompt.md`
Linux	`~/.config/Code/User/prompts/autoresearch.prompt.md`

Enable prompt files in VS Code settings.json:

hljs language-jsonc

{ "chat.promptFiles": true }

Invoke: /autoresearch <goal> in Copilot Chat.

🪐 OpenCode

OpenCode keeps custom commands at ~/.config/opencode/command/.

hljs language-bash

mkdir -p ~/.config/opencode/command
cp    ~/autoresearch-skill/SKILL.md  ~/.config/opencode/command/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.config/opencode/command/

Invoke: /autoresearch <goal> in OpenCode.

⚡ Cursor

Cursor loads rules from .cursor/rules/*.mdc.

hljs language-bash

mkdir -p .cursor/rules
cp    ~/autoresearch-skill/SKILL.md  .cursor/rules/autoresearch.mdc
cp -r ~/autoresearch-skill/references .cursor/rules/

Prepend this frontmatter to .cursor/rules/autoresearch.mdc:

hljs language-markdown

---
description: Autonomous goal-directed iteration
globs:
alwaysApply: false
---

Invoke: @autoresearch reduce API p95 to 200ms in Cursor Chat (the @rule syntax attaches the rule to the prompt).

For a global install across every project, use Cursor's Settings → Rules → User Rules pane and paste the skill content.

🌊 Windsurf

Windsurf loads rules/workflows from .windsurf/rules/.

hljs language-bash

mkdir -p .windsurf/rules
cp    ~/autoresearch-skill/SKILL.md  .windsurf/rules/autoresearch.md
cp -r ~/autoresearch-skill/references .windsurf/rules/

Prepend this frontmatter:

hljs language-markdown

---
trigger: manual
description: Autonomous goal-directed iteration
---

Invoke: @autoresearch reduce bundle size below 200KB in Windsurf Chat.

For global scope, drop the files in ~/.windsurf/rules/ instead of the project-local path.

📝 Vanilla VS Code / Continue / any Markdown-aware chat

Any extension that respects .github/copilot-instructions.md (e.g. GitHub Copilot Chat, Cline, Roo Code):

hljs language-bash

mkdir -p .github
cat ~/autoresearch-skill/SKILL.md >> .github/copilot-instructions.md

For Continue:

hljs language-bash

mkdir -p ~/.continue/commands
cp ~/autoresearch-skill/SKILL.md ~/.continue/commands/autoresearch.md

Then prompt the assistant: "Apply the autoresearch skill to reduce p95 to 200ms" — the full playbook is already in context.

🧾 Compatibility matrix

Tool	Install path	Invocation
Claude Code	`~/.claude/skills/autoresearch/`	`/autoresearch <goal>`
OpenAI Codex CLI	`~/.codex/prompts/`	`/autoresearch <goal>`
GitHub Copilot	`.github/prompts/` · user prompts dir	`/autoresearch <goal>`
OpenCode	`~/.config/opencode/command/`	`/autoresearch <goal>`
Cursor	`.cursor/rules/*.mdc`	`@autoresearch <goal>`
Windsurf	`.windsurf/rules/`	`@autoresearch <goal>`
Continue	`~/.continue/commands/`	`/autoresearch <goal>`
Cline / Roo Code	`.github/copilot-instructions.md`	natural language

⚠️ Note — /autoresearch works autonomously. For best results use a model with strong agentic and long-context capabilities (Claude Opus/Sonnet 4+, GPT-4.1+, Gemini 2.5 Pro).

🧭 How it works

hljs language-mermaid

flowchart TD
    A["/autoresearch [goal]"] --> B[Parse goal → 7 slots]
    B --> C{"corpus<br/>required?"}
    C -- yes --> D[📦 Phase A<br/>Ingest real data]
    C -- no --> E[🛠️ Phase B<br/>Build harness]
    D --> E
    E --> F[📐 Phase C<br/>Capture baseline]
    F --> G[🛡️ Phase D<br/>Run regression suite]
    G --> H[🔎 Phase E<br/>Read hot path]
    H --> I((LOOP))
    I --> J[Modify ONE file]
    J --> K[git commit]
    K --> L[Run harness]
    L --> M[Run regression suite]
    M --> N{metric<br/>improved?}
    N -- yes --> O[✅ keep + log]
    N -- no --> P[🗑️ discard + reset]
    M -- regressed --> P
    O --> Q{goal<br/>hit?}
    P --> I
    Q -- no --> I
    Q -- yes --> R[🏁 Done]

    style A fill:#8A2BE2,stroke:#333,color:#fff
    style D fill:#FF6F00,stroke:#333,color:#fff
    style E fill:#FF6F00,stroke:#333,color:#fff
    style F fill:#FF6F00,stroke:#333,color:#fff
    style G fill:#FF6F00,stroke:#333,color:#fff
    style H fill:#FF6F00,stroke:#333,color:#fff
    style O fill:#2e7d32,stroke:#333,color:#fff
    style P fill:#c62828,stroke:#333,color:#fff
    style R fill:#1565c0,stroke:#333,color:#fff

🎛️ Subcommands

Command	Purpose
`/autoresearch <goal>`	Default path — parse free-form goal, build harness, loop until goal met
`/autoresearch`	Bare autonomous loop (assumes scope/metric/verify already defined)
`/autoresearch:plan`	Interactive wizard: Goal → Scope → Metric → Verify
`/autoresearch:security`	Autonomous security audit (STRIDE + OWASP Top 10 + red-team personas)

Chain with Claude Code's /loop for bounded runs:

hljs language-arduino

/loop 25 /autoresearch reduce bundle size below 200KB

🧩 Goal-parsing rubric

When you type /autoresearch <goal>, Claude extracts seven slots from your free-form text:

Slot	Example	Fallback
metric	`latency`, `reliability`, `coverage`, `flakiness`, `bundle size`	Ask user
direction	`reduce/lower/minimise` → min · `raise/maximise` → max	Inferred from noun
target	`500ms`, `95%`, `0%`, `<200KB`	"best achievable" (unbounded)
scope	Files matching goal domain terms	Whole repo minus deps
corpus_source	`prod logs`, `fixtures`, `scraped data`	Required for empirical metrics
verify_cmd	`python benchmark.py`	Constructed during Phase B
regression_cmd	`pytest -q`, `npm test`, `cargo test`, `go test ./...`	Auto-detected

Three worked examples:

hljs language-css

/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, verify_cmd=python benchmark.py

/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, corpus=CI run history

/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, regression_cmd=pytest -q

🏗️ The Harness Protocol

The default path runs through five mandatory phases before the loop begins.

hljs language-mermaid

sequenceDiagram
    autonumber
    participant User
    participant Claude
    participant Repo
    participant Tests

    User->>Claude: /autoresearch [goal]
    Claude->>Claude: Parse goal → 7 slots
    Claude->>User: Print parsed-slot dump

    Note over Claude,Repo: 📦 Phase A — Corpus Ingestion
    Claude->>Repo: Scrape/locate real data
    Claude->>Repo: Write autoresearch/data/*.jsonl
    Claude-->>User: corpus N cases from source

    Note over Claude,Repo: 🛠️ Phase B — Harness Construction
    Claude->>Repo: Write benchmark.py (single file)
    Claude->>Claude: verify stdout starts with metric:

    Note over Claude,Repo: 📐 Phase C — Baseline Capture
    Claude->>Repo: Run benchmark.py
    Claude->>Repo: Append iteration #0 to results.tsv

    Note over Claude,Tests: 🛡️ Phase D — Regression Gate
    Claude->>Tests: Run pytest -q (or equivalent)
    Claude->>Repo: Record N_pre in .regression-baseline

    Note over Claude,Repo: 🔎 Phase E — Hot-Path Reading
    Claude->>Repo: Trace entry → handler → I/O
    Claude-->>User: 3-5 candidate ideas

    loop Until goal hit or interrupted
        Claude->>Repo: Modify ONE file
        Claude->>Repo: git commit
        Claude->>Repo: Run benchmark.py
        Claude->>Tests: Run regression suite
        alt regression passed && metric improved
            Claude->>Repo: status=keep
        else regression failed OR metric worse
            Claude->>Repo: git reset --hard HEAD~1
            Claude->>Repo: status=discard
        end
    end

Full protocol: references/benchmark-harness.md

🛡️ The Eleven Critical Rules

Loop until done — unbounded: loop forever; bounded: loop N then summarize
Read before write — full context before any modification
One change per iteration — atomic, attributable
Mechanical verification only — no subjective judgments
Automatic rollback — failed changes revert instantly
Simplicity wins — equal result + less code = keep

Git is memory — every kept change commits; agent reads history
When stuck, think harder — re-read, combine near-misses, try radical
Real data only — synthetic cases forbidden
Regression gate is absolute — drop a test, auto-discard
Harness is read-only — harness edits need a harness: commit

📊 Real-world case study

Executed in this project (WhatsApp Signal Trader on Binance testnet):

Iteration	Change	Metric (avg latency)	Reliability	Status
#0	baseline	2008.9 ms	1.000	baseline
#1	parallelize exit-order placement with `asyncio.gather`	646.4 ms	1.000	✅ keep
#2	30s TTL cache on signed `/api/v3/account`	592.1 ms	1.000	✅ keep
#3	prewarm account + TCP pool, serialize signals with Sem(1)	488.8 ms	1.000	✅ keep

Result: 2008 ms → 488 ms (76% reduction) with zero regressions across 373 pre-existing tests. Goal of < 500ms achieved in 3 iterations.

📁 File structure

hljs language-graphql

autoresearch/
├── SKILL.md                             # Entry point read by Claude Code
├── README.md                            # This file
├── LICENSE                              # MIT
└── references/
    ├── core-principles.md               # 7 generalisable Karpathy principles
    ├── autonomous-loop-protocol.md      # Phase-by-phase loop rules
    ├── benchmark-harness.md             # Corpus + harness + regression gate
    ├── results-logging.md               # TSV schema for iteration logs
    ├── plan-workflow.md                 # /autoresearch:plan wizard
    └── security-workflow.md             # /autoresearch:security audit

🧬 Domain adaptability

Domain	Metric	Scope	Verify	Corpus Source
Backend code	Tests pass + coverage %	`src/*/.ts`	`npm test`	test fixtures
Frontend UI	Lighthouse score	`src/components/**`	`npx lighthouse`	staging URLs
ML training	val_bpb / loss	`train.py`	`uv run train.py`	training dataset
Blog/content	Word count + readability	`content/*.md`	custom script	source manuscripts
Performance	Benchmark time (ms)	target files	`npm run bench`	benchmark inputs
Refactoring	Tests pass + LOC reduced	target module	`npm test && wc -l`	existing test suite
Security	OWASP + STRIDE coverage	API/auth/middleware	`/autoresearch:security`	codebase
Real-traffic perf	p95 latency (ms)	hot-path files	`python benchmark.py`	prod log tail

🙏 Credit & Inspiration

Built on the shoulders of giants.

Andrej Karpathy — for the original autoresearch pattern: single file, single metric, iterate.
Strix — adversarial AI security testing with PoC validation (inspiration for /autoresearch:security).
OWASP Top 10 — the industry-standard vulnerability taxonomy.
STRIDE — Microsoft's threat-modeling framework.

The core insight from Karpathy that drives every design decision here:

Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.

📝 License

MIT — see LICENSE.

If this skill saves you a milestone's worth of manual tuning, a ⭐ on the repo is appreciated.

Report an issue · Open a PR

🔬 Autoresearch Skill

Autonomous, Goal-Directed Iteration for Claude Code

Inspired by Andrej Karpathy's autoresearch — extended into a universal, real-data benchmark-driven workflow for any engineering task.

hljs language-markdown


          ╔════════════════════════════════════════════════════╗
          ║   MODIFY → VERIFY → REGRESS → KEEP / DISCARD → ∞   ║
          ╚════════════════════════════════════════════════════╝

✨ What is this?

Autoresearch is a Claude Code skill that turns a free-form goal like

hljs language-bash

/autoresearch reduce API p95 latency to 200ms

into an autonomous, self-correcting optimization loop that:

🧠 Parses the goal into seven machine-readable slots
📦 Ingests real data (refusing synthetic corpora)
🛠️ Builds a single-file benchmark harness
📐 Captures a baseline + regression test count
🔁 Iterates — one atomic change at a time
✅ Keeps wins, 🗑️ auto-discards regressions, logs everything
🏁 Stops when the target metric is hit

No hand-holding. No "should I continue?" Just mechanical iteration until the goal is reached.

🌟 Why use it?

🎯 Mechanical, not subjective

Every iteration is judged by a single floating-point metric extracted from a command. "Looks better" is banned.

🛡️ Regression-proof

A hard gate rolls back any change that drops a pre-existing passing test — no matter how big the win looked.

📊 Real data only

The harness refuses synthetic corpora. If you can't scrape, export, or tail it from reality, the loop won't start.

♻️ Atomic & reversible

One change per iteration, git-committed before verification. A failed experiment is always git reset --hard HEAD~1 away.

🧩 Domain-agnostic

Backend latency, test coverage, bundle size, flakiness, LOC, build time, lighthouse scores — same loop, different metric.

🌐 Global Claude Code skill

Install once under ~/.claude/skills/autoresearch/ and invoke /autoresearch <goal> from any project.

🚀 Quick Start

1. Install (global skill)

hljs language-bash

# Clone into your Claude Code skills directory
git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git \
  ~/.claude/skills/autoresearch

On Windows:

hljs language-powershell

git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git `
  "$env:USERPROFILE\.claude\skills\autoresearch"

2. Invoke

Open Claude Code in any project and type:

hljs language-bash

/autoresearch reduce avg_latency_ms below 500

Claude will print the parsed slot dump, walk through the five harness phases, then begin the autonomous loop.

3. Watch it work

Results are appended to autoresearch/results.tsv in your project:

hljs language-tsv

commit    metric     avg_latency_ms  status      description
953b71d   1.000000   2008.9          baseline    77-case corpus, 10 signals executed
953b71d   1.000000   646.4           keep        parallelize _place_exits with asyncio.gather
953b71d   1.000000   592.1           keep        add 30s TTL cache on account()
953b71d   1.000000   488.8           keep        prewarm account+connection pool, serialize signals

🔌 Use with other AI coding assistants

hljs language-bash

git clone https://github.com/Muminur/autoresearch-skill-Andrej-Karpathy.git ~/autoresearch-skill

On Windows substitute %USERPROFILE%\autoresearch-skill for ~/autoresearch-skill in the commands below.

🧠 OpenAI Codex CLI

Codex CLI loads custom prompts from ~/.codex/prompts/.

hljs language-bash

mkdir -p ~/.codex/prompts
cp    ~/autoresearch-skill/SKILL.md  ~/.codex/prompts/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.codex/prompts/

Invoke in Codex chat: /autoresearch reduce API p95 to 200ms.

💡 Tip — add @file ~/.codex/prompts/autoresearch.md to ~/.codex/AGENTS.md to keep the rubric pre-loaded in every session.

🤖 GitHub Copilot (VS Code & JetBrains)

Copilot Chat supports prompt files (*.prompt.md) and instructions files (*.instructions.md).

Project-scoped (checked into the repo):

hljs language-bash

mkdir -p .github/prompts
cp    ~/autoresearch-skill/SKILL.md  .github/prompts/autoresearch.prompt.md
cp -r ~/autoresearch-skill/references .github/prompts/

User-scoped (available in every project):

OS	Path
Windows	`%APPDATA%\Code\User\prompts\autoresearch.prompt.md`
macOS	`~/Library/Application Support/Code/User/prompts/autoresearch.prompt.md`
Linux	`~/.config/Code/User/prompts/autoresearch.prompt.md`

Enable prompt files in VS Code settings.json:

hljs language-jsonc

{ "chat.promptFiles": true }

Invoke: /autoresearch <goal> in Copilot Chat.

🪐 OpenCode

OpenCode keeps custom commands at ~/.config/opencode/command/.

hljs language-bash

mkdir -p ~/.config/opencode/command
cp    ~/autoresearch-skill/SKILL.md  ~/.config/opencode/command/autoresearch.md
cp -r ~/autoresearch-skill/references ~/.config/opencode/command/

Invoke: /autoresearch <goal> in OpenCode.

⚡ Cursor

Cursor loads rules from .cursor/rules/*.mdc.

hljs language-bash

mkdir -p .cursor/rules
cp    ~/autoresearch-skill/SKILL.md  .cursor/rules/autoresearch.mdc
cp -r ~/autoresearch-skill/references .cursor/rules/

Prepend this frontmatter to .cursor/rules/autoresearch.mdc:

hljs language-markdown

---
description: Autonomous goal-directed iteration
globs:
alwaysApply: false
---

Invoke: @autoresearch reduce API p95 to 200ms in Cursor Chat (the @rule syntax attaches the rule to the prompt).

For a global install across every project, use Cursor's Settings → Rules → User Rules pane and paste the skill content.

🌊 Windsurf

Windsurf loads rules/workflows from .windsurf/rules/.

hljs language-bash

mkdir -p .windsurf/rules
cp    ~/autoresearch-skill/SKILL.md  .windsurf/rules/autoresearch.md
cp -r ~/autoresearch-skill/references .windsurf/rules/

Prepend this frontmatter:

hljs language-markdown

---
trigger: manual
description: Autonomous goal-directed iteration
---

Invoke: @autoresearch reduce bundle size below 200KB in Windsurf Chat.

For global scope, drop the files in ~/.windsurf/rules/ instead of the project-local path.

📝 Vanilla VS Code / Continue / any Markdown-aware chat

Any extension that respects .github/copilot-instructions.md (e.g. GitHub Copilot Chat, Cline, Roo Code):

hljs language-bash

mkdir -p .github
cat ~/autoresearch-skill/SKILL.md >> .github/copilot-instructions.md

For Continue:

hljs language-bash

mkdir -p ~/.continue/commands
cp ~/autoresearch-skill/SKILL.md ~/.continue/commands/autoresearch.md

Then prompt the assistant: "Apply the autoresearch skill to reduce p95 to 200ms" — the full playbook is already in context.

🧾 Compatibility matrix

Tool	Install path	Invocation
Claude Code	`~/.claude/skills/autoresearch/`	`/autoresearch <goal>`
OpenAI Codex CLI	`~/.codex/prompts/`	`/autoresearch <goal>`
GitHub Copilot	`.github/prompts/` · user prompts dir	`/autoresearch <goal>`
OpenCode	`~/.config/opencode/command/`	`/autoresearch <goal>`
Cursor	`.cursor/rules/*.mdc`	`@autoresearch <goal>`
Windsurf	`.windsurf/rules/`	`@autoresearch <goal>`
Continue	`~/.continue/commands/`	`/autoresearch <goal>`
Cline / Roo Code	`.github/copilot-instructions.md`	natural language

⚠️ Note — /autoresearch works autonomously. For best results use a model with strong agentic and long-context capabilities (Claude Opus/Sonnet 4+, GPT-4.1+, Gemini 2.5 Pro).

🧭 How it works

hljs language-mermaid

flowchart TD
    A["/autoresearch [goal]"] --> B[Parse goal → 7 slots]
    B --> C{"corpus<br/>required?"}
    C -- yes --> D[📦 Phase A<br/>Ingest real data]
    C -- no --> E[🛠️ Phase B<br/>Build harness]
    D --> E
    E --> F[📐 Phase C<br/>Capture baseline]
    F --> G[🛡️ Phase D<br/>Run regression suite]
    G --> H[🔎 Phase E<br/>Read hot path]
    H --> I((LOOP))
    I --> J[Modify ONE file]
    J --> K[git commit]
    K --> L[Run harness]
    L --> M[Run regression suite]
    M --> N{metric<br/>improved?}
    N -- yes --> O[✅ keep + log]
    N -- no --> P[🗑️ discard + reset]
    M -- regressed --> P
    O --> Q{goal<br/>hit?}
    P --> I
    Q -- no --> I
    Q -- yes --> R[🏁 Done]

    style A fill:#8A2BE2,stroke:#333,color:#fff
    style D fill:#FF6F00,stroke:#333,color:#fff
    style E fill:#FF6F00,stroke:#333,color:#fff
    style F fill:#FF6F00,stroke:#333,color:#fff
    style G fill:#FF6F00,stroke:#333,color:#fff
    style H fill:#FF6F00,stroke:#333,color:#fff
    style O fill:#2e7d32,stroke:#333,color:#fff
    style P fill:#c62828,stroke:#333,color:#fff
    style R fill:#1565c0,stroke:#333,color:#fff

🎛️ Subcommands

Command	Purpose
`/autoresearch <goal>`	Default path — parse free-form goal, build harness, loop until goal met
`/autoresearch`	Bare autonomous loop (assumes scope/metric/verify already defined)
`/autoresearch:plan`	Interactive wizard: Goal → Scope → Metric → Verify
`/autoresearch:security`	Autonomous security audit (STRIDE + OWASP Top 10 + red-team personas)

Chain with Claude Code's /loop for bounded runs:

hljs language-arduino

/loop 25 /autoresearch reduce bundle size below 200KB

🧩 Goal-parsing rubric

When you type /autoresearch <goal>, Claude extracts seven slots from your free-form text:

Slot	Example	Fallback
metric	`latency`, `reliability`, `coverage`, `flakiness`, `bundle size`	Ask user
direction	`reduce/lower/minimise` → min · `raise/maximise` → max	Inferred from noun
target	`500ms`, `95%`, `0%`, `<200KB`	"best achievable" (unbounded)
scope	Files matching goal domain terms	Whole repo minus deps
corpus_source	`prod logs`, `fixtures`, `scraped data`	Required for empirical metrics
verify_cmd	`python benchmark.py`	Constructed during Phase B
regression_cmd	`pytest -q`, `npm test`, `cargo test`, `go test ./...`	Auto-detected

Three worked examples:

hljs language-css

/autoresearch reduce API p95 latency to 200ms
→ metric=p95_latency_ms, direction=minimise, target=200, verify_cmd=python benchmark.py

/autoresearch reduce test flakiness to 0%
→ metric=flaky_test_rate, direction=minimise, target=0, corpus=CI run history

/autoresearch increase signal-parser reliability to 99%
→ metric=reliability, direction=maximise, target=0.99, regression_cmd=pytest -q

🏗️ The Harness Protocol

The default path runs through five mandatory phases before the loop begins.

hljs language-mermaid

sequenceDiagram
    autonumber
    participant User
    participant Claude
    participant Repo
    participant Tests

    User->>Claude: /autoresearch [goal]
    Claude->>Claude: Parse goal → 7 slots
    Claude->>User: Print parsed-slot dump

    Note over Claude,Repo: 📦 Phase A — Corpus Ingestion
    Claude->>Repo: Scrape/locate real data
    Claude->>Repo: Write autoresearch/data/*.jsonl
    Claude-->>User: corpus N cases from source

    Note over Claude,Repo: 🛠️ Phase B — Harness Construction
    Claude->>Repo: Write benchmark.py (single file)
    Claude->>Claude: verify stdout starts with metric:

    Note over Claude,Repo: 📐 Phase C — Baseline Capture
    Claude->>Repo: Run benchmark.py
    Claude->>Repo: Append iteration #0 to results.tsv

    Note over Claude,Tests: 🛡️ Phase D — Regression Gate
    Claude->>Tests: Run pytest -q (or equivalent)
    Claude->>Repo: Record N_pre in .regression-baseline

    Note over Claude,Repo: 🔎 Phase E — Hot-Path Reading
    Claude->>Repo: Trace entry → handler → I/O
    Claude-->>User: 3-5 candidate ideas

    loop Until goal hit or interrupted
        Claude->>Repo: Modify ONE file
        Claude->>Repo: git commit
        Claude->>Repo: Run benchmark.py
        Claude->>Tests: Run regression suite
        alt regression passed && metric improved
            Claude->>Repo: status=keep
        else regression failed OR metric worse
            Claude->>Repo: git reset --hard HEAD~1
            Claude->>Repo: status=discard
        end
    end

Full protocol: references/benchmark-harness.md

🛡️ The Eleven Critical Rules

Loop until done — unbounded: loop forever; bounded: loop N then summarize
Read before write — full context before any modification
One change per iteration — atomic, attributable
Mechanical verification only — no subjective judgments
Automatic rollback — failed changes revert instantly
Simplicity wins — equal result + less code = keep

Git is memory — every kept change commits; agent reads history
When stuck, think harder — re-read, combine near-misses, try radical
Real data only — synthetic cases forbidden
Regression gate is absolute — drop a test, auto-discard
Harness is read-only — harness edits need a harness: commit

📊 Real-world case study

Executed in this project (WhatsApp Signal Trader on Binance testnet):

Iteration	Change	Metric (avg latency)	Reliability	Status
#0	baseline	2008.9 ms	1.000	baseline
#1	parallelize exit-order placement with `asyncio.gather`	646.4 ms	1.000	✅ keep
#2	30s TTL cache on signed `/api/v3/account`	592.1 ms	1.000	✅ keep
#3	prewarm account + TCP pool, serialize signals with Sem(1)	488.8 ms	1.000	✅ keep

Result: 2008 ms → 488 ms (76% reduction) with zero regressions across 373 pre-existing tests. Goal of < 500ms achieved in 3 iterations.

📁 File structure

hljs language-graphql

autoresearch/
├── SKILL.md                             # Entry point read by Claude Code
├── README.md                            # This file
├── LICENSE                              # MIT
└── references/
    ├── core-principles.md               # 7 generalisable Karpathy principles
    ├── autonomous-loop-protocol.md      # Phase-by-phase loop rules
    ├── benchmark-harness.md             # Corpus + harness + regression gate
    ├── results-logging.md               # TSV schema for iteration logs
    ├── plan-workflow.md                 # /autoresearch:plan wizard
    └── security-workflow.md             # /autoresearch:security audit

🧬 Domain adaptability

Domain	Metric	Scope	Verify	Corpus Source
Backend code	Tests pass + coverage %	`src/*/.ts`	`npm test`	test fixtures
Frontend UI	Lighthouse score	`src/components/**`	`npx lighthouse`	staging URLs
ML training	val_bpb / loss	`train.py`	`uv run train.py`	training dataset
Blog/content	Word count + readability	`content/*.md`	custom script	source manuscripts
Performance	Benchmark time (ms)	target files	`npm run bench`	benchmark inputs
Refactoring	Tests pass + LOC reduced	target module	`npm test && wc -l`	existing test suite
Security	OWASP + STRIDE coverage	API/auth/middleware	`/autoresearch:security`	codebase
Real-traffic perf	p95 latency (ms)	hot-path files	`python benchmark.py`	prod log tail

🙏 Credit & Inspiration

Built on the shoulders of giants.

Andrej Karpathy — for the original autoresearch pattern: single file, single metric, iterate.
Strix — adversarial AI security testing with PoC validation (inspiration for /autoresearch:security).
OWASP Top 10 — the industry-standard vulnerability taxonomy.
STRIDE — Microsoft's threat-modeling framework.

The core insight from Karpathy that drives every design decision here:

Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.

📝 License

MIT — see LICENSE.

If this skill saves you a milestone's worth of manual tuning, a ⭐ on the repo is appreciated.

Report an issue · Open a PR

autoresearch-skill-Andrej-Karpathy

🔬 Autoresearch Skill

Autonomous, Goal-Directed Iteration for Claude Code

✨ What is this?

🌟 Why use it?

🎯 Mechanical, not subjective

🛡️ Regression-proof

📊 Real data only

♻️ Atomic & reversible

🧩 Domain-agnostic

🌐 Global Claude Code skill

🚀 Quick Start

1. Install (global skill)

2. Invoke

3. Watch it work

🔌 Use with other AI coding assistants

🧠 OpenAI Codex CLI

🤖 GitHub Copilot (VS Code & JetBrains)

🪐 OpenCode

⚡ Cursor

🌊 Windsurf

📝 Vanilla VS Code / Continue / any Markdown-aware chat

🧾 Compatibility matrix

🧭 How it works

🎛️ Subcommands

🧩 Goal-parsing rubric

🏗️ The Harness Protocol

🛡️ The Eleven Critical Rules

📊 Real-world case study

📁 File structure

🧬 Domain adaptability

🙏 Credit & Inspiration

Built on the shoulders of giants.

📝 License

Similar Packages

autoresearch-skill-Andrej-Karpathy

🔬 Autoresearch Skill

Autonomous, Goal-Directed Iteration for Claude Code

✨ What is this?

🌟 Why use it?

🎯 Mechanical, not subjective

🛡️ Regression-proof

📊 Real data only

♻️ Atomic & reversible

🧩 Domain-agnostic

🌐 Global Claude Code skill

🚀 Quick Start

1. Install (global skill)

2. Invoke

3. Watch it work

🔌 Use with other AI coding assistants

🧠 OpenAI Codex CLI

🤖 GitHub Copilot (VS Code & JetBrains)

🪐 OpenCode

⚡ Cursor

🌊 Windsurf

📝 Vanilla VS Code / Continue / any Markdown-aware chat

🧾 Compatibility matrix

🧭 How it works

🎛️ Subcommands

🧩 Goal-parsing rubric

🏗️ The Harness Protocol

🛡️ The Eleven Critical Rules

📊 Real-world case study

📁 File structure

🧬 Domain adaptability

🙏 Credit & Inspiration

Built on the shoulders of giants.

📝 License

Similar Packages