A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
An open framework for scoring how well an MCP tool definition communicates to an AI agent – the Tool Definition Quality
TDQS is an open framework for scoring how well an MCP tool definition communicates to an AI agent. It is the methodology Glama uses to score every tool of every MCP server and hosted connector in its registry.
This document is the complete specification — the rubric, the exact prompts, the aggregation formulas, and the operational practices behind running it at registry scale — and is sufficient to reimplement TDQS end to end. If you maintain an MCP server and just want a better score, skip to Improving your score and use the rubric as a checklist.
Two empirical studies motivated the framework:
The defect taxonomy in that literature is called "smells". TDQS borrows the term.
A registry that ranks and recommends servers needs a quality signal that is:
TDQS scores a tool definition, not tool behavior. The inputs are exactly what an MCP client sees from tools/list:
| Input | Type | Notes |
|---|---|---|
name | string | required |
title | string | null | optional MCP display title |
description | string | null | the primary scoring target |
inputSchema | JSON Schema | null | parameter structure and per-parameter descriptions |
outputSchema | JSON Schema | null | reduces what the description itself must explain |
annotations | object | null | MCP hints: readOnlyHint, destructiveHint, idempotentHint, openWorldHint |
siblingToolNames | string[] | names of the other tools on the same server |
Sibling tool names matter because selection is competitive: a description is only "clear" if it lets an agent distinguish this tool from its neighbors.
Each tool flows through four stages. Stages 1, 2, and 4 are deterministic code; only stage 3 is an LLM call.
tool definition
│
▼
[1] context signals ──── deterministic schema/annotation analysis + input hash
│
▼
[2] hard gates ───────── degenerate definitions short-circuit (no LLM call)
│
▼
[3] LLM rubric ───────── six dimensions, 1–5 each, justification per dimension
│
▼
[4] post-processing ──── deterministic overrides, flags, smells
│
▼
TDQS (1.0–5.0) + tier (A–F) + per-dimension breakdown
Before any judgment is made, deterministic code extracts structural facts about the definition. These ground the LLM evaluation (so it does not have to count parameters or guess at schema coverage) and are stored alongside the score:
| Signal | Definition |
|---|---|
paramCount | number of properties in inputSchema.properties |
requiredParamCount | length of inputSchema.required |
paramsWithDescriptions | properties with a non-empty description |
paramsWithEnums | properties with an enum |
schemaDescriptionCoverage | round(paramsWithDescriptions / paramCount × 100); 100 when the tool has zero parameters |
hasNestedObjects | any property of type: "object" |
hasOutputSchema | output schema present and non-empty |
hasAnnotations | annotations present and non-empty |
annotationValues | extracted readOnly / destructive / idempotent / openWorld booleans (or null when undeclared) |
titleIsMeaningful | title exists, differs from the name, and is longer than the name |
inputHash | first 16 hex chars of sha256 over a deterministic JSON serialization of the tool's own definition fields (name, title, description, inputSchema, outputSchema, annotations) |
inputHash is what makes scoring incremental. It is written onto the tool row when the definition is captured and recorded again alongside the score, so a tool is considered up to date exactly when its row hash matches the hash stored with its score. A changed definition produces a new hash and re-enters scoring; an unchanged one is skipped without an LLM call. See Running TDQS at scale.
The hash covers only the tool's own definition, not the siblingToolNames context also passed to the evaluator. Because re-scoring is triggered only by a change to a tool's own hash, renaming a sibling does not re-score this tool; the new sibling context is reflected only when this tool's own definition next changes.
Degenerate definitions are scored without an LLM call:
null or whitespace-only) → every dimension scores 1, TDQS = 1.0, tier D, flag No Description. There is nothing to evaluate.The other gate is also evaluated up front, but rather than short-circuit the LLM call, it caps the model's score afterward:
Tautological Description and its Purpose Clarity score is capped at 2 regardless of what the LLM returns. A description that restates the name adds zero information, and models grading "looks reasonable" text will otherwise drift upward.A single LLM call scores six dimensions, each 1–5, each with a 2–3 sentence justification citing specific evidence from the description. The full system prompt is reproduced verbatim in Appendix A; the rubric it encodes:
| # | Dimension | Weight | Question |
|---|---|---|---|
| 1 | Purpose Clarity | 25% | Does the description state what the tool does? |
| 2 | Usage Guidelines | 20% | Does it say when to use this tool vs alternatives? |
| 3 | Behavioral Transparency | 20% | Does it disclose behavior beyond what annotations already declare? |
| 4 | Parameter Semantics | 15% | Does it add meaning beyond what the input schema provides? |
| 5 | Conciseness & Structure | 10% | Is it appropriately sized and front-loaded? |
| 6 | Contextual Completeness | 10% | Given the tool's complexity, is the description complete enough? |
The weights encode where selection and invocation actually fail. Purpose clarity dominates because it is what drives selection; usage guidelines and behavioral transparency follow because mis-selection and surprise side effects are the costliest failure modes; conciseness and completeness matter but rarely break an agent on their own.
Score anchors (each dimension):
Annotation Contradiction flag.schemaDescriptionCoverage > 80%, the baseline is 3 even when the description says nothing about parameters (the schema already does the work). If coverage < 50%, the description must compensate. Zero-parameter tools baseline at 4.These rules are embedded in the prompt and are what keep scores honest and useful:
The prompt also includes three fully worked calibration examples (high / mid / low) and a contradiction example (see Appendix A).
The model returns JSON only: a { score, justification } object for each of the six dimensions, plus a boolean annotation_contradiction and a short summary (exact shape in Appendix A). The response is validated against a schema — integer scores in [1, 5], all fields required — and retried on mismatch. This is the only nondeterministic step in the pipeline.
purpose_clarity = min(score, 2)).annotation_contradiction: true from the LLM raises the Annotation Contradiction flag.TDQS is the weighted sum of the six dimension scores, rounded to one decimal:
const DIMENSION_WEIGHTS = {
purpose_clarity: 0.25,
usage_guidelines: 0.2,
behavioral_transparency: 0.2,
parameter_semantics: 0.15,
conciseness_structure: 0.1,
contextual_completeness: 0.1,
};
const computeTdqs = (scores: Record<string, number>): number => {
let total = 0;
for (const [dimension, weight] of Object.entries(DIMENSION_WEIGHTS)) {
total += scores[dimension] * weight;
}
return Math.round(total * 10) / 10;
};
Worked example (purpose=4, guidelines=2, transparency=2, params=3, conciseness=4, completeness=2):
4×0.25 + 2×0.20 + 2×0.20 + 3×0.15 + 4×0.10 + 2×0.10 = 2.85 → TDQS 2.9
Scores map to letter tiers for at-a-glance display. The same mapping is used for every score in the system (per-tool TDQS, server description quality, coherence, overall):
const computeQualityTier = (score: number): string => {
if (score >= 3.5) return 'A';
if (score >= 3.0) return 'B';
if (score >= 2.0) return 'C';
if (score >= 1.0) return 'D';
return 'F';
};
| Tier | Range | Meaning |
|---|---|---|
| A | ≥ 3.5 | genuinely helpful definition |
| B | ≥ 3.0 | adequate, the passing bar |
| C | ≥ 2.0 | clear gaps |
| D | ≥ 1.0 | severely deficient |
| F | < 1.0 | reserved guard tier |
B and above is considered passing. Because every dimension bottoms out at 1, composite scores bottom out at 1.0; tier D is the floor in practice and F exists as a guard.
| Kind | Values | Source |
|---|---|---|
| Flags | No Description, Tautological Description, Annotation Contradiction | hard gates + LLM contradiction check |
| Smells | any dimension key scoring < 3 | derived from scores |
Flags mark categorical defects; smells mark below-viable dimensions. Both are stored and displayed.
Individually well-described tools can still compose into a confusing server. TDQS therefore rolls up to the server in two components, combined into an overall score.
Aggregated from the per-tool TDQS values of the server's current tool set:
descriptionQualityScore = round1(0.6 × mean(TDQS) + 0.4 × min(TDQS))
The 40% weight on the minimum is deliberate: an agent sees all of a server's tools at once, so a single garbage definition degrades selection across the whole set. Averages hide that; the min term makes one bad tool pull the score down visibly.
The aggregate is only computed once at least 80% of the server's tools have been scored. Partial coverage would otherwise misrepresent the server.
A second, separate LLM evaluation judges the tool set as a whole: server name, tool count, and every tool name + description in one prompt (verbatim prompt in Appendix B). Four dimensions, 1–5 each, equally weighted:
| Dimension | Question | Anchor points |
|---|---|---|
| Disambiguation | Can an agent tell the tools apart? | 5 = every tool clearly distinct · 1 = several tools appear to do the same thing |
| Naming Consistency | Do names follow a predictable pattern? | 5 = consistent verb_noun throughout · 1 = chaotic. The convention itself matters less than consistency: all-camelCase is fine; mixing styles is not |
| Tool Count Appropriateness | Is the surface well-scoped? | 5 = each tool earns its place (typically 3–15) · 3 = borderline (1–2 feels thin, 16–25 feels heavy) · 1 = extreme mismatch (50+, or a single trivial tool) |
| Completeness | Are there gaps in the surface? | 5 = full CRUD/lifecycle coverage, no dead ends · 3 = notable missing operations (create+get but no update/delete) · 1 = severely incomplete |
coherenceScore = round1((disambiguation + namingConsistency + toolCountAppropriateness + completeness) / 4)
overallScore = round1(0.7 × descriptionQualityScore + 0.3 × coherenceScore)
Both components, all per-dimension justifications, the coherence summary, and the tool-level stats (toolCount, scoredToolCount, meanTdqs, minTdqs) are stored and published together.
Per tool:
{
"scores": {
"purpose_clarity": 4,
"usage_guidelines": 2,
"behavioral_transparency": 2,
"parameter_semantics": 3,
"conciseness_structure": 4,
"contextual_completeness": 2
},
"justifications": {
"purpose_clarity": {
"score": 4,
"justification": "Clear verb+resource..."
},
"usage_guidelines": {
"score": 2,
"justification": "No when-to-use guidance..."
}
},
"tdqs": 2.9,
"tier": "C",
"smells": ["usage_guidelines", "behavioral_transparency", "contextual_completeness"],
"flags": [],
"summary": "Clear about what it updates, silent about when to use it and what side effects to expect.",
"contextSignals": {
"paramCount": 4,
"requiredParamCount": 1,
"paramsWithDescriptions": 2,
"paramsWithEnums": 0,
"schemaDescriptionCoverage": 50,
"hasNestedObjects": false,
"hasOutputSchema": false,
"hasAnnotations": false,
"annotationValues": {
"readOnly": null,
"destructive": null,
"idempotent": null,
"openWorld": null
},
"titleIsMeaningful": false,
"inputHash": "9f2c4a1b8e3d5f70"
}
}
Per server:
{
"toolCount": 12,
"scoredToolCount": 12,
"meanTdqs": 3.4,
"minTdqs": 2.1,
"descriptionQualityScore": 2.9,
"descriptionQualityTier": "C",
"disambiguation": 4,
"namingConsistency": 5,
"toolCountAppropriateness": 5,
"completeness": 4,
"coherenceScore": 4.5,
"coherenceTier": "A",
"overallScore": 3.4,
"overallTier": "B",
"coherenceJustifications": { "...": "..." },
"coherenceSummary": "..."
}
The rubric doubles as a checklist. The highest-leverage fixes, in weight order:
search_calls_extensive instead.").readOnlyHint, destructiveHint, idempotentHint, openWorldHint): they lower the disclosure burden on your description. Then spend the description on what annotations cannot express: what exactly gets destroyed, auth prerequisites, rate limits.description fields and enums raise your schema coverage, which raises your Parameter Semantics baseline.Never let the description contradict the annotations. Doing so is an automatic 1 on Behavioral Transparency and a public Annotation Contradiction flag.
Figures as of June 2026.
TDQS has scored 228,369 tools across 15,036 MCP servers in the registry (99% coverage). Counting superseded versions retained from earlier captures, 362,226 tool definitions have been scored in total. The aggregate both validates the framework's weighting and corroborates the literature that motivated it.
The tool-level figures that follow aggregate all 362,226 scored definitions, not just the 228,369 current tools. Scores skew capable: mean TDQS is 3.57 (median 3.6, p25 2.9, p75 4.2), and 73.5% clear the tier-B passing bar.
| Tier | Tools | Share |
|---|---|---|
| A (≥ 3.5) | 195,872 | 54.1% |
| B (≥ 3.0) | 70,186 | 19.4% |
| C (≥ 2.0) | 86,303 | 23.8% |
| D (≥ 1.0) | 9,865 | 2.7% |
But the composite hides where definitions fail. Broken out by dimension, the corpus is strong on what a tool does and weak on when to use it and what it does behind the scenes. This is precisely the split the weights are built to surface. The two heaviest non-purpose weights (20% each) sit on the two weakest dimensions:
| Dimension | Weight | Mean | Smell rate (< 3) |
|---|---|---|---|
| Purpose Clarity | 25% | 4.47 | 4.0% |
| Conciseness & Structure | 10% | 4.47 | 3.8% |
| Contextual Completeness | 10% | 3.23 | 32.8% |
| Parameter Semantics | 15% | 3.19 | 14.0% |
| Usage Guidelines | 20% | 3.03 | 44.5% |
| Behavioral Transparency | 20% | 2.90 | 46.1% |
Behavioral transparency averages below the minimum-viable 3, and 56.4% of tools carry at least one smell. This is the same systemic gap the motivating studies flagged (usage guidance and behavioral disclosure are where descriptions most often fall short), reached here independently, from a different rubric on a larger corpus.
The hard gates stay surgical: 1.26% of tools have no description, 0.55% contradict their annotations, 0.06% are tautological.
At the server level, the rollup's min-term earns its keep: a server's worst tool sits on average 0.49 below its mean TDQS, and 13.4% of servers have at least one tool a full point or more below their mean. A pure average would hide that selection damage. Mean overall server score is 3.56 (description quality 3.29, coherence 4.18); coherence is strongest on naming (4.57) and disambiguation (4.50), weakest on tool-count appropriateness (3.87).
How Glama operates the framework across the registry. None of this changes the scoring semantics, but it is what makes continuous scoring affordable:
inputHash has no matching score: never scored, or scored against a different definition. The scoring step recomputes the hash and exits early if it still matches, so a redundant enqueue costs a cheap no-op, never a wasted LLM call.inputHash (carried forward atomically). A rebuild that touches one tool re-scores one tool, not the whole set.inputHash; a changed definition no longer matches the recorded score and is re-scored on the next sweep, while unchanged tools are left alone.The system prompt, verbatim:
You evaluate MCP tool definitions. Score how well the definition helps an AI agent select and invoke the tool correctly.
You receive: the tool's name, title, description, input schema, annotations, context signals, and sibling tool names.
## Dimensions (1-5 each)
### 1. Purpose Clarity (25%)
Does the description state what the tool does?
5=specific verb+resource, distinguishes from siblings. 4=clear but no sibling differentiation. 3=vague purpose. 2=tautology (restates name/title). 1=missing/misleading.
### 2. Usage Guidelines (20%)
Does it say when to use this tool vs alternatives?
5=explicit when/when-not/alternatives. 4=clear context, no exclusions. 3=implied usage. 2=no guidance. 1=misleading.
### 3. Behavioral Transparency (20%)
Does the description disclose behavioral traits BEYOND what annotations already provide?
With annotations: bar is lower, credit for adding context (what gets destroyed, auth needs, rate limits). Without annotations: description carries full burden.
Score 1 if description CONTRADICTS annotations. Flag as "Annotation Contradiction".
### 4. Parameter Semantics (15%)
Does description add meaning beyond what the input schema provides?
If schema_description_coverage is high (>80%), baseline is 3 even with no param info in description.
If low (<50%), description must compensate. 0 params = baseline 4.
### 5. Conciseness & Structure (10%)
Is it appropriately sized and front-loaded? Every sentence should earn its place.
### 6. Contextual Completeness (10%)
Given complexity + schema/annotations/output_schema richness, is the description complete enough?
If output schema exists, description needn't explain return values.
## Rules
- Use the FULL 1-5 range. Most descriptions are mediocre. 4-5 is reserved for genuinely helpful ones.
- Score 3 = minimum viable. Adequate but with clear gaps.
- Score each dimension INDEPENDENTLY.
- Base scores on SPECIFIC EVIDENCE in the description text.
- The description's job is to add VALUE BEYOND structured fields (annotations, schema). No credit for repeating what's already in structured data.
## Calibration Examples
### HIGH – TDQS 4.3 (Tier A)
Name: get_calls | Annotations: { readOnlyHint: true, destructiveHint: false } | Params: 2, schema coverage 100% | No output schema
Description: "List ALL calls in date range – no user/workspace filtering. To filter by user/workspace, use search_calls_extensive instead."
Scores:
- purpose=5 (specific verb+resource+scope, distinguishes from sibling)
- guidelines=5 (explicit alternative named)
- transparency=3: Annotations already declare readOnlyHint=true and destructiveHint=false, so the agent knows this is a safe read operation. The description adds the date-range scoping constraint, which is useful context. However, it doesn't describe the return format or pagination behavior. With annotations covering the safety profile, a 3 is appropriate – the description adds some value but not rich behavioral context.
- params=3: Schema coverage is 100%, so the schema already documents both parameters. The description implies date-range filtering but doesn't add syntax or format details beyond what the schema provides. Baseline 3 is correct when schema does the heavy lifting.
- conciseness=5 (two sentences, zero waste)
- completeness=5 (complete for a filtered-list tool with good annotations)
### MID – TDQS 2.9 (Tier C)
Name: update_drive | Annotations: none | Params: 4, schema coverage 50% | No output schema
Description: "Update shared drive settings including name, color, and restrictions."
Scores:
- purpose=4 (clear verb+resource+fields)
- guidelines=2 (no when-to-use, no prerequisites)
- transparency=2: No annotations are provided, so the description carries the full burden of behavioral disclosure. "Update" implies mutation, but the description doesn't state whether this requires specific permissions, whether changes are reversible, what happens to existing settings not mentioned, or what the response looks like. For a mutation tool with zero annotation coverage, this is a significant gap.
- params=3: The description lists the updatable fields (name, color, restrictions), which maps to some of the 4 parameters. However, with only 50% schema description coverage, the other half of parameters are undocumented in both the schema and the description. The description adds marginal value over what's in the schema but doesn't fully compensate for the coverage gap.
- conciseness=4 (efficient single sentence)
- completeness=2 (mutation tool with no annotations, no output schema, incomplete params – should do more)
### LOW – TDQS 1.1 (Tier D)
Name: process | Annotations: none | Params: 3, schema coverage 0% | No output schema
Description: "Process"
Scores: purpose=1 (tautology), guidelines=1 (none), transparency=1 (none, no annotations), params=1 (3 undocumented params at 0% coverage), conciseness=2 (under-specification, not conciseness), completeness=1 (completely inadequate for a 3-param tool).
### CONTRADICTION
Name: create_record | Annotations: { readOnlyHint: true } | Description: "Creates a new record in the database"
→ transparency=1, annotation_contradiction=true. The description claims a write operation ("creates") while annotations declare readOnlyHint=true. This is a serious inconsistency.
## Output Format
Respond with JSON only:
{
"scores": {
"purpose_clarity": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"},
"usage_guidelines": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"},
"behavioral_transparency": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"},
"parameter_semantics": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"},
"conciseness_structure": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"},
"contextual_completeness": {"score": <1-5>, "justification": "<2-3 sentences citing evidence>"}
},
"annotation_contradiction": <true if description contradicts annotations, false otherwise>,
"summary": "<2-3 sentence assessment>"
}
The user message template ({placeholders} are substituted; schemas are serialized as pretty-printed, sanitized JSON):
TOOL NAME: {name}
TITLE: {title | "null"}
DESCRIPTION:
"{description}"
<input-schema>
{inputSchema JSON | "{}"}
</input-schema>
<annotations>
{annotations JSON | "None provided"}
</annotations>
CONTEXT SIGNALS:
- Parameter count: {paramCount}
- Required parameters: {requiredParamCount}
- Schema description coverage: {schemaDescriptionCoverage}%
- Parameters with enums: {paramsWithEnums}
- Has output schema: {hasOutputSchema}
- Has nested objects: {hasNestedObjects}
<sibling-tools>
{sibling tool names, one per line | "None"}
</sibling-tools>
Respond with JSON only.
The system prompt, verbatim:
You evaluate whether an MCP server's tools work well together as a set. Individual tools may have good descriptions, but the set can still be confusing, inconsistent, or incomplete.
You receive: the server name, all tool names with their descriptions, and the tool count.
## Dimensions (1-5 each, equally weighted)
### 1. Disambiguation
Can an agent tell the tools apart? Tools with overlapping purposes cause misselection.
5=every tool has a clearly distinct purpose, no ambiguity. 4=mostly distinct, one or two could be confused. 3=some overlap exists but descriptions help. 2=multiple tools have unclear boundaries. 1=several tools appear to do the same thing.
### 2. Naming Consistency
Do tool names follow a predictable pattern?
5=consistent verb_noun pattern throughout (e.g. list_issues, create_issue, delete_issue). 4=mostly consistent with minor deviations. 3=mixed conventions but still readable. 2=inconsistent (camelCase mixed with snake_case, different verb styles). 1=chaotic naming with no discernible pattern.
### 3. Tool Count Appropriateness
Is the number of tools appropriate for the server's purpose?
5=well-scoped, each tool earns its place (typically 3-15 tools). 4=slightly over or under but reasonable. 3=borderline (1-2 tools feels thin, 16-25 feels heavy). 2=too many (25+) or too few (1) for the apparent scope. 1=extreme mismatch (50+ tools, or a single trivial tool).
### 4. Completeness
Are there obvious gaps in the tool surface?
5=complete CRUD/lifecycle coverage for the domain, no dead ends. 4=minor gaps that agents can work around. 3=notable missing operations (e.g. create+get but no update/delete). 2=significant gaps that will cause agent failures. 1=severely incomplete surface for the stated purpose.
## Calibration
### HIGH – Coherence 4.8
Server: github-mcp | Tools: list_repos, get_repo, create_repo, search_code, list_issues, create_issue, update_issue, get_pull_request, create_pull_request, merge_pull_request
Scores: disambiguation=5 (each tool targets a distinct resource+action), naming=5 (consistent verb_noun), count=5 (10 tools, well-scoped), completeness=4 (PR review/comment missing but core workflows covered).
### LOW – Coherence 2.0
Server: utility-toolkit | Tools: process, run, execute, do_thing, helper, transform_data, processV2, handleRequest
Scores: disambiguation=1 (process/run/execute/do_thing are indistinguishable), naming=2 (mixed styles, vague verbs), count=3 (8 tools, reasonable count), completeness=2 (no clear domain, impossible to assess coverage).
## Rules
- Use the FULL 1-5 range. Most servers have mediocre coherence.
- Evaluate the TOOL SET, not individual descriptions. A server with great individual descriptions but overlapping tools should score low on disambiguation.
- For naming, the specific convention matters less than consistency. All camelCase is fine; mixing conventions is not.
- For completeness, infer the domain from tool names and descriptions, then assess if the surface covers it.
Respond with JSON only:
{
"scores": {
"disambiguation": {"score": <1-5>, "justification": "<2-3 sentences>"},
"naming_consistency": {"score": <1-5>, "justification": "<2-3 sentences>"},
"tool_count_appropriateness": {"score": <1-5>, "justification": "<2-3 sentences>"},
"completeness": {"score": <1-5>, "justification": "<2-3 sentences>"}
},
"summary": "<2-3 sentence overall assessment>"
}
The user message template:
SERVER NAME: {serverName}
TOOL COUNT: {toolCount}
<tools>
- {name}: {description | "(no description)"}
- ...
</tools>
Respond with JSON only.
readOnlyHint / destructiveHint / idempotentHint / openWorldHint semantics the framework grades against.MCP server integration for DaVinci Resolve Studio
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots
A Jetbrains IDE IntelliJ plugin aimed to provide coding agents the ability to leverage intelliJ's indexing of the codeba