Token Wiki

A reference on token optimization techniques used in production LLM applications. Based on analysis of four systems: Claude Code, Cline, Codex, and OpenCode.

Key Findings

Four independent teams converged on the same 5-layer architecture: modular prompt assembly, multi-layer output truncation, message normalization, threshold-based compaction, and cache coordination.
All four systems estimate tokens with length / 4. No tokenizer. It's consistently accurate enough for threshold decisions at zero cost.
Tool results — not model output — consume ~45% of context in agentic systems. A single file read is 5-10k tokens.
Prompt caching reduces cost by ~87% ($0.06/turn vs $0.45/turn on a 150k conversation), but any prefix mutation (timestamps, dynamic counters) invalidates the entire cache.
Models advertise 128k output limits. 99% of real responses use ~5k. Reserving the full limit wastes 123k tokens of usable context per request.

Quickstart

QUICKSTART.md contains copy-paste prompts you can feed to any AI coding agent to audit your codebase for token waste and apply fixes. No background reading required.

Chapter	Topic
Context Compaction	Conversation compression — from free pruning to LLM summarization
Token Counting & Estimation	Fast heuristics, hybrid tracking, output budgeting
Prompt Caching	Cache-aware prompt design, breakpoint placement, invalidation detection
System Prompt Optimization	Modular assembly, variant architectures, stable-first ordering
Tool Output Management	Multi-layer truncation, head/tail preservation, pagination
Message Architecture	Normalization, tool pairing invariants, streaming data models
Context Window Management	Budget allocation, effective window calculation, degradation strategies
Multi-Agent Context	Subagent isolation, history forking, token budgets
Diagnostics & Observability	Token attribution, duplicate detection, cost tracking
Design Patterns	Cross-cutting principles and the 20 highest-impact optimizations

Architecture

Every system studied implements the same defense-in-depth model:

hljs language-markdown

1. PREVENT    Tool-specific limits at the source
2. TRUNCATE   Generic caps catch what slips through
3. CACHE      Prompt caching cuts repeat costs by 10x
4. PRUNE      Clear stale results before expensive operations
5. COMPACT    LLM summarization as last resort

Reference Numbers

Metric	Value
Bytes per token (heuristic)	4
Cache read vs. input cost	10x cheaper
Compaction trigger	85-93% of context window
Output token cap (recommended)	8-32k
Tool output budget	10k tokens per result
Post-compaction target	~50k tokens

Sources

Based on token management implementations in:

Claude Code — 3-layer compaction architecture with prompt cache coordination
Cline — Provider-agnostic token management with variant-based prompt optimization
Codex — Multi-layer tool truncation with fork-based subagent isolation
OpenCode — 5-pillar defense-in-depth approach with 20+ provider support

Similar Packages

SkillAccessibility Audit Skill

Community

WCAG accessibility audit — automated scanning, manual review, remediation

2v2.9.02 months agoCompare

accessibilitya11y

SkillPlaywright Pro Skill

Community

Playwright testing toolkit — test generation, flaky test fix, migration helpers

2v2.9.02 months agoCompare

playwrighte2e

SkillGit Worktree Manager Skill

Community

Parallel development with git worktrees — port isolation, env sync, cleanup

2v2.9.02 months agoCompare

gitworktree

SkillAwesome Agent Skills

Community

1000+ skills curated from Anthropic, Vercel, Stripe, and other engineering teams

4v1.0.02 months agoCompare

skillscurated

Key Findings

Four independent teams converged on the same 5-layer architecture: modular prompt assembly, multi-layer output truncation, message normalization, threshold-based compaction, and cache coordination.

All four systems estimate tokens with length / 4. No tokenizer. It's consistently accurate enough for threshold decisions at zero cost.

Tool results — not model output — consume ~45% of context in agentic systems. A single file read is 5-10k tokens.

Prompt caching reduces cost by ~87% ($0.06/turn vs $0.45/turn on a 150k conversation), but any prefix mutation (timestamps, dynamic counters) invalidates the entire cache.

Models advertise 128k output limits. 99% of real responses use ~5k. Reserving the full limit wastes 123k tokens of usable context per request.

Contents

Chapter	Topic
Context Compaction	Conversation compression — from free pruning to LLM summarization
Token Counting & Estimation	Fast heuristics, hybrid tracking, output budgeting
Prompt Caching	Cache-aware prompt design, breakpoint placement, invalidation detection
System Prompt Optimization	Modular assembly, variant architectures, stable-first ordering
Tool Output Management	Multi-layer truncation, head/tail preservation, pagination
Message Architecture	Normalization, tool pairing invariants, streaming data models
Context Window Management	Budget allocation, effective window calculation, degradation strategies
Multi-Agent Context	Subagent isolation, history forking, token budgets
Diagnostics & Observability	Token attribution, duplicate detection, cost tracking
Design Patterns	Cross-cutting principles and the 20 highest-impact optimizations

Architecture

Every system studied implements the same defense-in-depth model:

hljs language-markdown

1. PREVENT    Tool-specific limits at the source
2. TRUNCATE   Generic caps catch what slips through
3. CACHE      Prompt caching cuts repeat costs by 10x
4. PRUNE      Clear stale results before expensive operations
5. COMPACT    LLM summarization as last resort

Metric

Value

Bytes per token (heuristic)

Cache read vs. input cost

10x cheaper

Compaction trigger

85-93% of context window

Output token cap (recommended)

8-32k

Tool output budget

10k tokens per result

Post-compaction target

~50k tokens

Sources

Based on token management implementations in:

Claude Code — 3-layer compaction architecture with prompt cache coordination

Cline — Provider-agnostic token management with variant-based prompt optimization

Codex — Multi-layer tool truncation with fork-based subagent isolation

OpenCode — 5-pillar defense-in-depth approach with 20+ provider support

Similar Packages

SkillAccessibility Audit Skill

Community

WCAG accessibility audit — automated scanning, manual review, remediation

2v2.9.02 months agoCompare

accessibilitya11y

SkillPlaywright Pro Skill

Community

Playwright testing toolkit — test generation, flaky test fix, migration helpers

2v2.9.02 months agoCompare

playwrighte2e

SkillGit Worktree Manager Skill

Community

Parallel development with git worktrees — port isolation, env sync, cleanup

2v2.9.02 months agoCompare

gitworktree

SkillAwesome Agent Skills

Community

1000+ skills curated from Anthropic, Vercel, Stripe, and other engineering teams

4v1.0.02 months agoCompare

skillscurated

token-wiki

Token Wiki

Key Findings

Quickstart

Contents

Architecture

Reference Numbers

Sources

Similar Packages

token-wiki

Token Wiki

Key Findings

Quickstart

Contents

Architecture

Reference Numbers

Sources

Similar Packages