A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
A Claude Code skill that adds a rubric-based eval layer to any agent project. Framework-agnostic — generates rubric, tes
A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.
You bring the agent. The skill gives you:
This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:
git clone <repo-url> ~/.claude/skills/eval-layer
Then from any Claude Code session: /eval-layer <your agent path>.
"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.
Two measurements, not one:
[0, 1]mean(judge_score − human_reference_score) on [-1, +1]
abs < 0.10 → well-calibrated0.10 – 0.25 → slight bias, monitorabs > 0.25 → recalibrate before trusting scoresEvery framework adapter returns the same shape. This is what makes cross-framework comparison honest:
{
"recommendation": <schema instance or None>, # agent's structured output
"latency_ms": int, # wall-clock including tool execution
"tool_calls": int, # count of tool invocations
"input_tokens": int | None, # summed across turns
"output_tokens": int | None,
"model_id": str,
"error": str | None,
}
If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.
In Claude Code, invoke the skill against your project:
/eval-layer Add an eval layer to /path/to/your/agent
Claude reads the agent, proposes a rubric, and on your confirmation generates:
your-project/
evals/
eval_harness.py
rubrics/main.yaml
prompts/judge.md
test_cases/seed.yaml
reports/
raw/<subject>.jsonl
eval_<subject>_<ts>.md
framework-comparison.html # multi-subject only
Smoke-test one case:
python evals/eval_harness.py --framework <subject> --test-case easy-01 -v
Full run:
python evals/eval_harness.py --framework <subject>
Multi-subject sweep + dashboard:
python evals/eval_harness.py --framework all
python evals/make_html_report.py
Every harness generated by this skill exposes these:
| Flag | Purpose |
|---|---|
--framework NAME | which subject to run (or all) |
--test-case ID | run only one case — essential for debugging adapters |
-v / --verbose | print per-case metadata as it runs |
--trials N | run each case N times (pass@k / variance) |
eval-layer/
├── SKILL.md # entry point — the process Claude follows
├── references/
│ ├── rubric-design.md # dimension catalog, scales, leniency thresholds, anti-patterns
│ ├── judge-prompts.md # judge prompt template + calibration techniques
│ ├── judge-robustness.md # defensive JSON parse, retry-once, never-drop-a-result
│ ├── framework-adapters.md # copy-paste recipes per framework (w/ the metadata contract)
│ ├── structured-output-troubleshooting.md # the three Bedrock Opus errors and the two-stage fix
│ ├── cross-subject-benchmarking.md # multi-subject flow (models / frameworks / prompts)
│ └── html-report-template.html # self-contained Chart.js dashboard template
└── README.md
SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.
If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:
"This model does not support assistant message prefill""No valid tool use or tool use input was found in the Bedrock response""minItems values other than 0 or 1 are not supported"The fix:
Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.
See references/structured-output-troubleshooting.md.
Before handing off an eval produced by this skill:
reference_scores for leniencyparse_judge_response helper--framework, --test-case, -v, --trials flags presentA Claude Code skill by Hao (駱君昊) that learns your Facebook voice and auto-posts to FB / IG / Threads / X with a 14-day c
1000+ skills curated from Anthropic, Vercel, Stripe, and other engineering teams
Claude Code skill for YouTube creators — channel audits, video SEO, retention scripts, thumbnails, content strategy, Sho
AI image generation skill for Claude Code -- Creative Director powered by Gemini