A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands
A 4-stage evaluation framework for testing Claude Code plugin component triggering. Validates whether skills, agents, commands, hooks, and MCP servers correctly activate when expected.
Claude Code plugins contain multiple component types (skills, agents, commands) that trigger based on user prompts. Testing these triggers manually is time-consuming and error-prone. This framework automates the entire evaluation process:
| Feature | Description |
|---|---|
| 4-Stage Pipeline | Analysis → Generation → Execution → Evaluation |
| Multi-Component | Skills, agents, commands, hooks, and MCP servers |
| Programmatic Detection | 100% confidence detection by parsing tool captures |
| Semantic Testing | Synonym and paraphrase variations to test robustness |
| Resume Capability | Checkpoint after each stage, resume interrupted runs |
| Cost Estimation | Token and USD estimates before execution |
| Batch API Support | 50% cost savings on large runs via Anthropic Batches API |
| Multiple Formats | JSON, YAML, JUnit XML, TAP output |
# Clone the repository
git clone https://github.com/sjnims/cc-plugin-eval.git
cd cc-plugin-eval
# Install dependencies
npm install
# Build
npm run build
# Create .env file with your API key
echo "ANTHROPIC_API_KEY=sk-ant-your-key-here" > .env
# See cost estimate without running (recommended first)
npx cc-plugin-eval run -p ./path/to/your/plugin --dry-run
# Run full evaluation
npx cc-plugin-eval run -p ./path/to/your/plugin
flowchart LR
subgraph Input
P[Plugin Directory]
end
subgraph Pipeline
S1[**Stage 1: Analysis**<br/>Parse plugin structure,<br/>extract triggers]
S2[**Stage 2: Generation**<br/>Create test scenarios<br/>positive & negative]
S3[**Stage 3: Execution**<br/>Run scenarios via<br/>Agent SDK]
S4[**Stage 4: Evaluation**<br/>Detect triggers,<br/>calculate metrics]
end
P --> S1 --> S2 --> S3 --> S4
S1 --> O1[analysis.json]
S2 --> O2[scenarios.json]
S3 --> O3[transcripts/]
S4 --> O4[evaluation.json]
| Stage | Purpose | Method | Output |
|---|---|---|---|
| 1. Analysis | Parse plugin structure, extract trigger phrases | Deterministic parsing | analysis.json |
| 2. Generation | Create test scenarios | LLM for skills/agents, deterministic for commands | scenarios.json |
| 3. Execution | Run scenarios against Claude Agent SDK | Tool capture hooks | transcripts/ |
| 4. Evaluation | Detect triggers, calculate metrics | Programmatic first, LLM judge for quality | evaluation.json |
Each component generates multiple scenario types to thoroughly test triggering:
| Type | Description | Example |
|---|---|---|
direct | Exact trigger phrase | "create a skill" |
paraphrased | Same intent, different words | "add a new skill to my plugin" |
edge_case | Unusual but valid | "skill plz" |
negative | Should NOT trigger | "tell me about database skills" |
semantic | Synonym variations | "generate a skill" vs "create a skill" |
# Run complete evaluation
cc-plugin-eval run -p ./plugin
# With options
cc-plugin-eval run -p ./plugin \
--config custom-config.yaml \
--verbose \
--samples 3
# Stage 1: Analysis only
cc-plugin-eval analyze -p ./plugin
# Stages 1-2: Analysis + Generation
cc-plugin-eval generate -p ./plugin
# Stages 1-3: Analysis + Generation + Execution
cc-plugin-eval execute -p ./plugin
# Resume an interrupted run
cc-plugin-eval resume -r <run-id>
# List previous runs
cc-plugin-eval list -p ./plugin
# Generate report from existing results
cc-plugin-eval report -r <run-id> --output junit-xml
| Option | Description |
|---|---|
-p, --plugin <path> | Plugin directory path |
-c, --config <path> | Config file (default: config.yaml) |
--dry-run | Generate scenarios without execution |
--estimate | Show cost estimate before execution |
--verbose | Enable debug output |
--fast | Only run previously failed scenarios |
--no-batch | Force synchronous (non-batch) execution |
--rewind | Undo file changes after each scenario |
--semantic | Enable semantic variation testing |
--samples <n> | Multi-sample judgment count |
--reps <n> | Repetitions per scenario |
--output <format> | Output format: json, yaml, junit-xml, tap |
In addition to the CLI, cc-plugin-eval exports a programmatic API for integration into build systems, test frameworks, and custom tooling.
npm install cc-plugin-eval
import {
runAnalysis,
runGeneration,
runExecution,
runEvaluation,
loadConfigWithOverrides,
consoleProgress,
} from "cc-plugin-eval";
import type {
EvalConfig,
AnalysisOutput,
TestScenario,
} from "cc-plugin-eval/types";
// Load configuration
const config = loadConfigWithOverrides("config.yaml", {
plugin: "./path/to/plugin",
});
// Stage 1: Analyze plugin structure
const analysis = await runAnalysis(config);
// Stage 2: Generate test scenarios
const generation = await runGeneration(analysis, config);
// Stage 3: Execute scenarios (captures tool interactions)
const execution = await runExecution(
analysis,
generation.scenarios,
config,
consoleProgress, // or provide custom progress callbacks
);
// Stage 4: Evaluate results
const evaluation = await runEvaluation(
analysis.plugin_name,
generation.scenarios,
execution.results,
config,
consoleProgress,
);
console.log(`Accuracy: ${(evaluation.metrics.accuracy * 100).toFixed(1)}%`);
| Export | Description |
|---|---|
runAnalysis | Stage 1: Parse plugin structure and extract triggers |
runGeneration | Stage 2: Generate test scenarios for components |
runExecution | Stage 3: Execute scenarios and capture tool interactions |
runEvaluation | Stage 4: Evaluate results and calculate metrics |
loadConfigWithOverrides | Load configuration with CLI-style overrides |
consoleProgress | Default progress reporter (console output) |
Import types via the cc-plugin-eval/types subpath:
import type {
EvalConfig,
AnalysisOutput,
TestScenario,
ExecutionResult,
EvaluationResult,
EvalMetrics,
} from "cc-plugin-eval/types";
Configuration is managed via config.yaml. Here's a quick reference:
scope:
skills: true # Evaluate skill components
agents: true # Evaluate agent components
commands: true # Evaluate command components
hooks: false # Evaluate hook components
mcp_servers: false # Evaluate MCP server components
generation:
model: "claude-sonnet-4-5-20250929"
scenarios_per_component: 5 # Test scenarios per component
diversity: 0.7 # 0.0-1.0, higher = more unique scenarios
semantic_variations: true # Generate synonym variations
execution:
model: "claude-sonnet-4-20250514"
max_turns: 5 # Conversation turns per scenario
timeout_ms: 60000 # Timeout per scenario (1 min)
max_budget_usd: 10.0 # Stop if cost exceeds this
disallowed_tools: # Safety: block file operations
- Write
- Edit
- Bash
evaluation:
model: "claude-sonnet-4-5-20250929"
detection_mode: "programmatic_first" # Or "llm_only"
num_samples: 1 # Multi-sample judgment
See the full config.yaml for all options, including:
tuning: Fine-tune timeouts, retry behavior, and token estimatesconflict_detection: Detect when multiple components trigger for the same promptbatch_threshold: Use Anthropic Batches API for cost savings (50% discount)sanitization: PII redaction with ReDoS-safe custom patternsBy default, scenarios testing the same component share a session with /clear between them. This reduces subprocess overhead by ~80%:
| Mode | Overhead per Scenario | 100 Scenarios |
|---|---|---|
| Batched (default) | ~1-2s after first | ~2-3 minutes |
| Isolated | ~5-8s each | ~8-13 minutes |
The /clear command resets conversation history between scenarios while reusing the subprocess and loaded plugin.
Switch to isolated mode when you need complete separation between scenarios:
rewind_file_changes: true (automatically uses isolated mode)To use isolated mode:
execution:
session_strategy: "isolated"
Or via the deprecated (but still supported) option:
execution:
session_isolation: true
After a run, results are saved to:
results/
└── {plugin-name}/
└── {run-id}/
├── state.json # Pipeline state (for resume)
├── analysis.json # Stage 1: Parsed components
├── scenarios.json # Stage 2: Generated test cases
├── execution-metadata.json # Stage 3: Execution stats
├── evaluation.json # Stage 4: Results & metrics
└── transcripts/
└── {scenario-id}.json # Individual execution transcripts
{
"results": [
{
"scenario_id": "skill-create-direct-001",
"triggered": true,
"confidence": 100,
"quality_score": 9.2,
"detection_source": "programmatic",
"has_conflict": false
}
],
"metrics": {
"total_scenarios": 25,
"accuracy": 0.92,
"trigger_rate": 0.88,
"avg_quality": 8.7,
"conflict_count": 1
}
}
Programmatic detection is primary for maximum accuracy:
Skill, Task, and SlashCommand callsmcp__<server>__<tool>SDKHookResponseMessage eventsLLM judge is secondary, used for:
npm install # Install dependencies
npm run build # Build TypeScript
npm test # Run tests
npm run lint # Lint code
npm run typecheck # Type check
See CONTRIBUTING.md for detailed development setup, code style, testing requirements, and pull request guidelines.
Default: execution.permission_bypass: true enables automated evaluation by automatically approving all tool invocations. This is required for unattended runs but has security implications:
disallowed_tools to restrict dangerous operations (default: [Write, Edit, Bash])permission_bypass: false for manual review (disables automation)Security Note: With permission bypass enabled, use strict disallowed_tools and run in sandboxed environments when evaluating untrusted plugins.
Default: output.sanitization.enabled: false for backwards compatibility. Enable sanitization for PII-sensitive environments:
output:
sanitize_transcripts: true # Redact saved files
sanitize_logs: true # Redact console output
sanitization:
enabled: true
custom_patterns: # Optional domain-specific patterns
- pattern: "INTERNAL-\\w+"
replacement: "[REDACTED_ID]"
Built-in redaction: API keys, JWT tokens, emails, phone numbers, SSNs, credit card numbers.
Enterprise use cases: Enable when handling PII or complying with GDPR, HIPAA, SOC 2, or similar regulations.
The default disallowed_tools: [Write, Edit, Bash] prevents file modifications and shell commands. Modify with caution:
Write/Edit only if testing file-modifying pluginsBash only if testing shell-executing pluginsrewind_file_changes: true to restore files after each scenario.env), never stored in configexecution.max_budget_usd to cap API spendingexecution.timeout_ms to prevent runaway executionsplugin.path), no remote loadingFor production/enterprise environments with compliance requirements, see the comprehensive security guide in SECURITY.md, including:
See CONTRIBUTING.md for development setup, code style, and pull request guidelines.
This project follows the Contributor Covenant code of conduct.
Steve Nims (@sjnims)
A Claude Code skill by Hao (駱君昊) that learns your Facebook voice and auto-posts to FB / IG / Threads / X with a 14-day c
Human + AI music production workflow for Suno - skills, templates, and tools
1000+ skills curated from Anthropic, Vercel, Stripe, and other engineering teams
Claude Code skill for YouTube creators — channel audits, video SEO, retention scripts, thumbnails, content strategy, Sho