Agent Runtime Intelligence Layer

Cost Savings: 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality.

Python • TypeScript • LangChain • OpenAI Agents • CrewAI • PydanticAI • Google ADK • n8n • Vercel AI • OpenClaw • Hermes Agent • 📖 Docs • 💡 Examples

The in-process intelligence layer for AI agents. Optimize cost, latency, quality, budget, compliance, and energy — inside the execution loop, not at the HTTP boundary.

cascadeflow works where external proxies can't: per-step model decisions based on agent state, per-tool-call budget gating, runtime stop/continue/escalate actions, and business KPI injection during agent loops. It accumulates insight from every model call, tool result, and quality score — the agent gets smarter the more it runs. Sub-5ms overhead. Works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, n8n, Vercel AI SDK, and Hermes Agent.

Update

Hermes Agent delegation cascading

CascadeFlow now provides a Hermes Agent integration for per-skill model cascading, task-complexity cascading, topic-aware subagent cascading, observe-mode rollout, and auditable decisions without taking over provider credentials, base URLs, fallback chains, or API modes.

hljs language-bash

pip install cascadeflow

hljs language-bash

npm install @cascadeflow/core

Why cascadeflow?

Proxy vs In-Process Harness

Dimension	External Proxy	cascadeflow Harness
Scope	HTTP request boundary	Inside agent execution loop
Dimensions	Cost only	Cost + quality + latency + budget + compliance + energy
Latency overhead	10-50ms network RTT	<5ms in-process
Business logic	None	KPI weights and targets
Enforcement	None (observe only)	stop, deny_tool, switch_model
Auditability	Request logs	Per-step decision traces

cascadeflow is a library and agent harness — an intelligent AI model cascading package that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

Use Cases

Inside-the-Loop Control. Influence decisions at every agent step — model call, tool call, sub-agent handoff — where most cost, delay, and failure actually happen. External proxies only see request boundaries; cascadeflow sees decision boundaries.
Multi-Dimensional Optimization. Optimize across cost, latency, quality, budget, compliance/risk, and energy simultaneously — relevant to engineering, finance, security, operations, and sustainability stakeholders.
Business Logic Injection. Embed KPI weights and policy intent directly into agent behavior at runtime. Shift AI control from static prompt design to live business governance.
Runtime Enforcement. Directly steer outcomes with four actions: allow, switch_model, deny_tool, stop — based on current context and policy state. Closes the gap between analytics and execution.
Auditability & Transparency. Every runtime decision is traceable and attributable. Supports audit requirements, faster tuning cycles, and trust in regulated or high-stakes workflows.
Measurable Value. Prove impact with reproducible metrics on realistic agent workflows — better economics and latency while preserving quality thresholds.
Latency Advantage. Proxy-based optimization adds 40-60ms per call. In a 10-step agent loop, that is 400-600ms of avoidable overhead. cascadeflow runs in-process with sub-5ms overhead — critical for real-time UX, task throughput, and enterprise SLAs.
Framework & Provider Neutral. Works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, Vercel AI SDK, n8n, Hermes Agent, and custom frameworks. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and more.
Self-Improving Agent Intelligence. Because cascadeflow runs inside the agent loop, it accumulates deep insight into every model call, tool result, quality score, and routing decision over time. This enables cascadeflow to learn which models perform best for which tasks, adapt routing strategies, and continuously improve cost-quality tradeoffs — without manual tuning. The agent gets smarter the more it runs.
Edge & Local-Hosted AI. Handle most queries with local models (vLLM, Ollama), automatically escalate complex queries to cloud providers only when needed.

ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper

How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (>17 providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

hljs language-java

┌─────────────────────────────────────────────────────────────┐
│                      cascadeflow Stack                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascade Agent                                        │  │
│  │                                                       │  │
│  │  Orchestrates the entire cascade execution            │  │
│  │  • Query routing & model selection                    │  │
│  │  • Drafter -> Verifier coordination                   │  │
│  │  • Cost tracking & telemetry                          │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Domain Pipeline                                      │  │
│  │                                                       │  │
│  │  Automatic domain classification                      │  │
│  │  • Rule-based detection (CODE, MATH, DATA, etc.)      │  │
│  │  • Optional ML semantic classification                │  │
│  │  • Domain-optimized pipelines & model selection       │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Quality Validation Engine                            │  │
│  │                                                       │  │
│  │  Multi-dimensional quality checks                     │  │
│  │  • Length validation (too short/verbose)              │  │
│  │  • Confidence scoring (logprobs analysis)             │  │
│  │  • Format validation (JSON, structured output)        │  │
│  │  • Semantic alignment (intent matching)               │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascading Engine (<2ms overhead)                     │  │
│  │                                                       │  │
│  │  Smart model escalation strategy                      │  │
│  │  • Try cheap models first (speculative execution)     │  │
│  │  • Validate quality instantly                         │  │
│  │  • Escalate only when needed                          │  │
│  │  • Automatic retry & fallback                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Provider Abstraction Layer                           │  │
│  │                                                       │  │
│  │  Unified interface for >17 providers                   │  │
│  │  • OpenAI • Anthropic • Groq • Ollama                 │  │
│  │  • Together • vLLM • HuggingFace • LiteLLM            │  │
│  │  • Vercel AI SDK (17+ additional providers)            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Harness API

Three tiers of integration — zero-change observability to full policy control:

Tier 1: Zero-change observability

hljs language-python

import cascadeflow
cascadeflow.init(mode="observe")
# All OpenAI/Anthropic SDK calls are now tracked. No code changes needed.

Tier 2: Scoped runs with budget

hljs language-python

with cascadeflow.run(budget=0.50, max_tool_calls=10) as session:
    result = await agent.run("Analyze this dataset")
    print(session.summary())  # cost, latency, energy, steps, tool calls
    print(session.trace())    # full decision audit trail

Tier 3: Decorated agents with policy

hljs language-python

@cascadeflow.agent(budget=0.20, compliance="gdpr", kpi_weights={"quality": 0.6, "cost": 0.3, "latency": 0.1})
async def my_agent(query: str):
    return await llm.complete(query)

Quick Start

Python

hljs language-python

pip install cascadeflow[all]

hljs language-python

from cascadeflow import CascadeAgent, ModelConfig

# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
    ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375),  # Draft model (~$0.375/1M tokens)
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),         # Verifier model (~$5.62/1M tokens)
])

# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")

print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")

💡 Optional: Use ML-based Semantic Quality Validation

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

Step 1: Install the optional ML package:

hljs language-bash

pip install cascadeflow[semantic]  # Adds semantic similarity via FastEmbed (~80MB model)

Step 2: Use semantic quality validation:

hljs language-python

from cascadeflow.quality.semantic import SemanticQualityChecker

# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
    similarity_threshold=0.5,  # Minimum similarity score (0-1)
    toxicity_threshold=0.7     # Maximum toxicity score (0-1)
)

# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."

result = checker.validate(query, response, check_toxicity=True)

print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")

What you get:

🎯 Semantic similarity scoring (query ↔ response alignment)
🛡️ Optional toxicity detection
🔄 Automatic model download and caching
🚀 Fast inference (~100ms per check)

Full example: See semantic_quality_domain_detection.py

⚠️ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

📖 Learn more: Python Documentation | Quickstart Guide | Providers Guide

TypeScript

hljs language-bash

npm install @cascadeflow/core

hljs language-tsx

import { CascadeAgent, ModelConfig } from '@cascadeflow/core';

// Same API as Python!
const agent = new CascadeAgent({
  models: [
    { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
});

const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);

💡 Optional: ML-based Semantic Quality Validation

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

Step 1: Install the optional ML packages:

hljs language-bash

npm install @cascadeflow/ml @huggingface/transformers

Step 2: Enable semantic validation in your cascade:

hljs language-tsx

import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
  quality: {
    threshold: 0.40,                    // Traditional confidence threshold
    requireMinimumTokens: 5,            // Minimum response length
    useSemanticValidation: true,        // Enable ML validation
    semanticThreshold: 0.5,             // 50% minimum similarity
  },
});

// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');

Step 3: Or use semantic validation directly:

hljs language-tsx

import { SemanticQualityChecker } from '@cascadeflow/core';

const checker = new SemanticQualityChecker();

if (await checker.isAvailable()) {
  const result = await checker.checkSimilarity(
    'What is TypeScript?',
    'TypeScript is a typed superset of JavaScript.'
  );

  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
  console.log(`Passed: ${result.passed}`);
}

What you get:

🎯 Query-response semantic alignment detection
🚫 Off-topic response filtering
📦 BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
⚡ Fast CPU inference (~50-100ms with caching)
🔄 Request-scoped caching (50% latency reduction)
🌐 Works in Node.js, Browser, and Edge Functions

Example: semantic-quality.ts

📖 Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples

🔄 Migration Example

Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.

Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

hljs language-python

# Using expensive model for everything
result = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

hljs language-python

agent = CascadeAgent(models=[
    ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])

result = await agent.run("What's 2+2?")

🔥 Saved: $0.000106 (94% reduction), 3.6x faster

📊 Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization

Drop-In Gateway (Existing Apps)

If you already have an app using the OpenAI or Anthropic APIs and want the fastest integration, run the gateway and point your existing client at it:

hljs language-bash

python -m cascadeflow.server --mode auto --port 8084

n8n Integration

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

Installation

Open n8n
Go to Settings → Community Nodes
Search for: @cascadeflow/n8n-nodes-cascadeflow
Click Install

Two Nodes

Node	Type	Use case
CascadeFlow (Model)	Language Model sub-node	Drop-in for any Chain/LLM node
CascadeFlow Agent	Standalone agent (`main` in/out)	Tool calling, memory, multi-step reasoning

Quick Start (Model):

Add two AI Chat Model nodes (cheap drafter + powerful verifier)
Add CascadeFlow (Model) and connect both models
Connect to Basic LLM Chain or Chain node
Check Logs tab on the Chain node to see cascade decisions

Quick Start (Agent):

Add a Chat Trigger node
Add CascadeFlow Agent and connect it to the trigger
Connect Drafter, Verifier, optional Memory and Tools
Check the Agent Output tab for cascade metadata and trace

Result: 40-85% cost savings in your n8n workflows!

Features:

Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
Mix providers (e.g., Ollama drafter + GPT-4o verifier)
Agent node: tool calling, memory, per-tool routing, tool call validation
16-domain cascading for specialized model routing
Real-time flow visualization in Logs/Output tabs

🔌 Learn more: n8n Integration Guide | n8n Package

Hermes Agent Integration

Use CascadeFlow as an optional Hermes Agent delegation router for subagents. Hermes keeps provider credentials, base URLs, fallback chains, and API modes; CascadeFlow returns a structured routing decision before Hermes spawns a child agent.

This works as a released CascadeFlow module even before a native Hermes PR is accepted. Users can call the router from a local wrapper, local Hermes fork, or small hook script and keep Hermes' current provider configuration as the final source of truth.

hljs language-python

from cascadeflow.integrations.hermes import (
    HermesDelegationRequest,
    HermesDelegationRouter,
)

router = HermesDelegationRouter.from_dict({
    "enabled": True,
    "mode": "observe",
    "routes": {
        "code": {
            "provider": "nous",
            "model": "nous/hermes-4.1",
            "reasoning_effort": "high",
        },
        "simple": {
            "provider": "openai",
            "model": "gpt-4.1-mini",
            "reasoning_effort": "low",
        },
    },
})

decision = router.route_delegation(HermesDelegationRequest(
    goal="Debug the failing unit test and propose a patch",
    toolsets=("terminal", "git"),
    loaded_skills=("python", "debugging"),
))
print(decision.to_dict())

What Hermes gets:

Per-skill model routing: coding, research, legal/finance, and lightweight utility skills can receive different model and reasoning profiles instead of inheriting one global default.
Task-complexity routing: simple delegated tasks can use cheaper/faster models, while hard debugging, architecture, research, or code-generation tasks route to stronger models.
Topic-aware subagent routing: subagents can route differently for code, research, data, creative, ops, medical, legal, finance, and other domains.
Better subagent economics: Hermes avoids paying flagship-model prices for simple worker tasks.
Better quality for hard tasks: difficult subagent work no longer has to inherit a weak or cheap default model.
Dry-run/observe mode: Hermes users can see what CascadeFlow would route without changing runtime behavior.
Auditability: routing decisions include reason, confidence, domain, complexity, and selected model.
Safer rollout: missing CascadeFlow, disabled config, low confidence, high-stakes gaps, or bad routing inputs fall back to Hermes' current behavior.
No credential rewrite: Hermes still owns provider credentials, base URLs, fallback chains, and API modes.

Learn more: Hermes Agent Integration Guide

Standalone example: examples/integrations/hermes_delegation_router.py

LangChain Integration

Use cascadeflow with LangChain for intelligent model cascading with full LCEL, streaming, and tools support!

Installation

TypeScript

hljs language-bash

npm install @cascadeflow/langchain @langchain/core @langchain/openai

Python

hljs language-bash

pip install cascadeflow langchain-openai

Quick Start

TypeScript - Drop-in replacement for any LangChain chat model

hljs language-typescript

import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';

const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'nous/hermes-flash' }),      // $0.15/$0.60 per 1M tokens
  verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }),  // $3/$15 per 1M tokens
  qualityThreshold: 0.8, // 80% queries use drafter
});

// Use like any LangChain chat model
const result = await cascade.invoke('Explain quantum computing');

// Optional: Enable LangSmith tracing (see https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

// Or with LCEL chains
const chain = prompt.pipe(cascade).pipe(new StringOutputParser());

Python - Drop-in replacement for any LangChain chat model

hljs language-python

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from cascadeflow.integrations.langchain import CascadeFlow

cascade = CascadeFlow(
    drafter=ChatOpenAI(model="nous/hermes-flash"),      # $0.15/$0.60 per 1M tokens
    verifier=ChatAnthropic(model="claude-sonnet-4-5"),  # $3/$15 per 1M tokens
    quality_threshold=0.8,  # 80% queries use drafter
)

# Use like any LangChain chat model
result = await cascade.ainvoke("Explain quantum computing")

# Optional: Enable LangSmith tracing (see https://smith.langchain.com)
# Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

# Or with LCEL chains
chain = prompt | cascade | StrOutputParser()

💡 Optional: Cost Tracking with Callbacks (Python)

Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:

hljs language-python

from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback

# Track costs similar to get_openai_callback()
with get_cascade_callback() as cb:
    response = await cascade.ainvoke("What is Python?")

    print(f"Total cost: ${cb.total_cost:.6f}")
    print(f"Drafter cost: ${cb.drafter_cost:.6f}")
    print(f"Verifier cost: ${cb.verifier_cost:.6f}")
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Successful requests: {cb.successful_requests}")

Features:

🎯 Compatible with get_openai_callback() pattern
💰 Separate drafter/verifier cost tracking
📊 Token usage (including streaming)
🔄 Works with LangSmith tracing
⚡ Near-zero overhead

Full example: See langchain_cost_tracking.py

💡 Optional: Model Discovery & Analysis Helpers (TypeScript)

For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:

hljs language-typescript

import {
  discoverCascadePairs,
  findBestCascadePair,
  analyzeModel,
  validateCascadePair
} from '@cascadeflow/langchain';

// Your existing LangChain models (configured with YOUR API keys)
const myModels = [
  new ChatOpenAI({ model: 'gpt-3.5-turbo' }),
  new ChatOpenAI({ model: 'nous/hermes-flash' }),
  new ChatOpenAI({ model: 'gpt-4o' }),
  new ChatAnthropic({ model: 'claude-3-haiku' }),
  // ... any LangChain chat models
];

// Quick: Find best cascade pair
const best = findBestCascadePair(myModels);
console.log(`Best pair: ${best.analysis.drafterModel} → ${best.analysis.verifierModel}`);
console.log(`Estimated savings: ${best.estimatedSavings}%`);

// Use it immediately
const cascade = withCascade({
  drafter: best.drafter,
  verifier: best.verifier,
});

// Advanced: Discover all valid pairs
const pairs = discoverCascadePairs(myModels, {
  minSavings: 50,              // Only pairs with ≥50% savings
  requireSameProvider: false,  // Allow cross-provider cascades
});

// Validate specific pair
const validation = validateCascadePair(drafter, verifier);
console.log(`Valid: ${validation.valid}`);
console.log(`Warnings: ${validation.warnings}`);

What you get:

🔍 Automatic discovery of optimal cascade pairs from YOUR models
💰 Estimated cost savings calculations
⚠️ Validation warnings for misconfigured pairs
📊 Model tier analysis (drafter vs verifier candidates)

Full example: See model-discovery.ts

Features:

✅ Full LCEL support (pipes, sequences, batch)
✅ Streaming with pre-routing
✅ Tool calling and structured output
✅ LangSmith cost tracking metadata
✅ Cost tracking callbacks (Python)
✅ Works with all LangChain features

🦜 Learn more: LangChain Integration Guide | TypeScript Package | Python Examples

Resources

Examples

Python Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup with OpenAI models	View
Preset Usage	Use built-in presets for quick setup	View
Tool Execution	Function calling and tool usage	View
Streaming Text	Stream responses from cascade agents	View
Cost Tracking	Track and analyze costs across queries	View
Agentic Multi-Agent	Multi-turn tool loops & agent-as-a-tool delegation	View
Multi-Step Cascade	Multi-step agent loops with tool calls	View

Harness & Enforcement - Budget, compliance, and agent governance

Example	Description	Link
Budget Enforcement	Budget caps with stop actions in enforce mode	View
User Budget Tracking	Per-user budget enforcement and tracking	View
Guardrails	Safety and content guardrails	View
Rate Limiting	Rate limiting for cascades	View
User Profile Usage	User-specific routing and configurations	View
Stripe Integration	Billing integration with budget enforcement	View

Framework Integrations - Harness with LangChain, OpenAI Agents, CrewAI, PydanticAI, Google ADK, Hermes Agent

Example	Description	Link
LangChain Harness	cascadeflow harness with LangChain callback handler	View
OpenAI Agents Harness	cascadeflow harness with OpenAI Agents SDK	View
CrewAI Harness	cascadeflow harness with CrewAI hooks	View
PydanticAI Harness	cascadeflow cascade Model with PydanticAI agents	View
Google ADK Harness	cascadeflow harness with Google ADK plugin	View
LangChain Basic	Simple LangChain cascade setup	View
LangChain LCEL Pipeline	LCEL chains with cascade routing	View
LangGraph Multi-Agent	LangGraph multi-agent orchestration	View

Advanced Examples - Production, providers & customization

Example	Description	Link
Production Patterns	Best practices for production deployments	View
Multi-Provider	Mix multiple AI providers in one cascade	View
Reasoning Models	Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)	View
Streaming Tools	Stream tool calls and responses	View
Batch Processing	Process multiple queries efficiently	View
FastAPI Integration	Integrate cascades with FastAPI	View
Edge Device	Run cascades on edge devices with local models	View
vLLM Example	Use vLLM for local model deployment	View
Multi-Instance Ollama	Run draft/verifier on separate Ollama instances	View
Custom Cascade	Build custom cascade strategies	View
Custom Validation	Implement custom quality validators	View
Semantic Quality Detection	ML-based domain and quality detection	View
Cost Forecasting	Forecast costs and detect anomalies	View

TypeScript Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup (Node.js)	View
Tool Calling	Function calling with tools (Node.js)	View
Multi-Provider	Mix providers in TypeScript (Node.js)	View
Reasoning Models	Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)	View
Cost Tracking	Track and analyze costs across queries	View
Semantic Quality	ML-based semantic validation with embeddings	View
Streaming	Stream responses in TypeScript	View
Tool Execution	Tool execution engine and result handling	View
Streaming Tools	Stream tool calls with event detection	View
Agentic Multi-Agent	Multi-turn tool loops & multi-agent orchestration	View

Advanced Examples - Production, edge & LangChain

Example	Description	Link
Production Patterns	Production best practices (Node.js)	View
Multi-Instance Ollama	Run draft/verifier on separate Ollama instances	View
Multi-Instance vLLM	Run draft/verifier on separate vLLM instances	View
Browser/Edge	Vercel Edge runtime example	View
LangChain Basic	Simple LangChain cascade setup	View
LangChain Cross-Provider	Haiku → GPT-5 with PreRouter	View
LangChain LangSmith	Cost tracking with LangSmith	View
LangChain Cost Tracking	Compare cascadeflow vs LangSmith cost tracking	View
LangGraph Multi-Agent	LangGraph multi-agent orchestration	View
LangChain Tool Risk Gating	Tool routing based on risk and complexity	View

📂 View All Python Examples → | View All TypeScript Examples →

Documentation

Getting Started - Core concepts and basics

Guide	Description	Link
Quickstart	Get started with cascadeflow in 5 minutes	Read
Providers Guide	Configure and use different AI providers	Read
Presets Guide	Using and creating custom presets	Read
Streaming Guide	Stream responses from cascade agents	Read
Tools Guide	Function calling and tool usage	Read
Cost Tracking	Track and analyze API costs	Read
Agentic Patterns	Tool loops, multi-agent, agent-as-a-tool delegation	Read
Agent Harness	Budget, compliance, KPI, and energy controls	Read
Rollout Guide	Plan your production rollout	Read

Advanced Topics - Production, customization & integrations

Guide	Description	Link
Production Guide	Best practices for production deployments	Read
Enterprise Networking	Proxy, TLS, and network configuration	Read
Customization	Custom cascade strategies and validators	Read
Observability	Telemetry, logging, and privacy controls	Read
LangChain Integration	Use cascadeflow with LangChain	Read
OpenAI Agents SDK	Use cascadeflow with OpenAI Agents	Read
CrewAI Integration	Use cascadeflow with CrewAI	Read
PydanticAI Integration	Cascade Model for PydanticAI agents	Read
Google ADK	Use cascadeflow with Google ADK	Read
Hermes Agent	Per-skill, complexity, and topic-aware subagent routing	Read
n8n Integration	Use cascadeflow in n8n workflows	Read
Vercel AI SDK	Middleware for Vercel AI SDK	Read

📚 View All Documentation →

Features

Feature	Benefit
🎯 Speculative Cascading	Tries cheap models first, escalates intelligently
💰 40-85% Cost Savings	Research-backed, proven in production
⚡ 2-10x Faster	Small models respond in <50ms vs 500-2000ms
⚡ Low Latency	Sub-2ms framework overhead, negligible performance impact
🔄 Mix Any Providers	OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) + LangChain integration
👤 User Profile System	Per-user budgets, tier-aware routing, enforcement callbacks
✅ Quality Validation	Automatic checks + semantic similarity (optional ML, ~80MB, CPU)
🎨 Cascading Policies	Domain-specific pipelines, multi-step validation strategies
🧠 Domain Understanding	15 domains auto-detected (code, medical, legal, finance, math, etc.), routes to specialists
🤖 Drafter/Validator Pattern	20-60% savings for agent/tool systems
🔧 Tool Calling Support	Universal format, works across all providers
📊 Cost Tracking	Built-in analytics + OpenTelemetry export (vendor-neutral)
🚀 3-Line Integration	Zero architecture changes needed
🔁 Agent Loops	Multi-turn tool execution with automatic tool call, result, re-prompt cycles
🧭 Hermes Agent Routing	Per-skill, task-complexity, and topic-aware subagent routing with observe-mode rollout
📋 Message & Tool Call Lists	Full conversation history with tool_calls and tool_call_id preservation across turns
🪝 Hooks & Callbacks	Telemetry callbacks, cost events, and streaming hooks for observability
🏭 Production Ready	Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection
💳 Budget Enforcement	Per-run and per-user budget caps with automatic stop actions when limits are exceeded
🔒 Compliance Gating	GDPR, HIPAA, PCI, and strict model allowlists — block non-compliant models before execution
📊 KPI-Weighted Routing	Inject business priorities (quality, cost, latency, energy) as weights into every model decision
🌱 Energy Tracking	Deterministic compute-intensity coefficients for carbon-aware AI operations
🔍 Decision Traces	Full per-step audit trail: action, reason, model, cost, budget state, enforcement status
⚙️ Harness Modes	off / observe / enforce — roll out safely with observe, then switch to enforce when ready

License

MIT © see LICENSE file.

Free for commercial use. Attribution appreciated but not required.

Contributing

We ❤️ contributions!

📝 Contributing Guide - Python & TypeScript development setup

Recently Shipped

✅ Agent Loops & Multi-Agent - Multi-turn tool execution, agent-as-a-tool delegation, LangGraph orchestration
✅ Tool Execution Engine - Automatic tool call routing, parallel execution, risk gating
✅ Hooks & Callbacks - Telemetry callbacks, cost events, streaming hooks for observability
✅ Vercel AI SDK Integration - 17+ additional providers with automatic provider detection
✅ OpenClaw Provider - Custom provider for OpenClaw deployments
✅ Gateway Server - Drop-in OpenAI/Anthropic-compatible proxy endpoint
✅ User Tier Management - Cost controls and limits per user tier with advanced routing
✅ Semantic Quality Validators - Lightweight local quality scoring via FastEmbed
✅ Code Complexity Detection - Dynamic cascading based on task complexity analysis
✅ Domain Aware Cascading - ML-based semantic domain detection with per-domain routing

Support

📖 GitHub Discussions - Searchable Q&A
🐛 GitHub Issues - Bug reports & feature requests
📧 Email Support - Direct support

Citation

If you use cascadeflow in your research or project, please cite:

hljs language-bibtex

@software{cascadeflow2025,
  author = {Lemony Inc., Sascha Buehrle and Contributors},
  title = {cascadeflow: Agent runtime intelligence layer for AI agent workflows},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/lemony-ai/cascadeflow}
}

Ready to cut your AI costs by 40-85%?

hljs language-bash

pip install cascadeflow

hljs language-bash

npm install @cascadeflow/core

Read the Docs • View Python Examples • View TypeScript Examples • Join Discussions

About

Built with ❤️ by Lemony Inc. and the cascadeflow Community

One cascade. Hundreds of specialists.

New York | Zurich

⭐ Star us on GitHub if cascadeflow helps you save money!

Agent Runtime Intelligence Layer

Cost Savings: 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality.

Python • TypeScript • LangChain • OpenAI Agents • CrewAI • PydanticAI • Google ADK • n8n • Vercel AI • OpenClaw • Hermes Agent • 📖 Docs • 💡 Examples

The in-process intelligence layer for AI agents. Optimize cost, latency, quality, budget, compliance, and energy — inside the execution loop, not at the HTTP boundary.

Update

Hermes Agent delegation cascading

CascadeFlow now provides a Hermes Agent integration for per-skill model cascading, task-complexity cascading, topic-aware subagent cascading, observe-mode rollout, and auditable decisions without taking over provider credentials, base URLs, fallback chains, or API modes.

hljs language-bash

pip install cascadeflow

hljs language-bash

npm install @cascadeflow/core

Why cascadeflow?

Proxy vs In-Process Harness

Dimension	External Proxy	cascadeflow Harness
Scope	HTTP request boundary	Inside agent execution loop
Dimensions	Cost only	Cost + quality + latency + budget + compliance + energy
Latency overhead	10-50ms network RTT	<5ms in-process
Business logic	None	KPI weights and targets
Enforcement	None (observe only)	stop, deny_tool, switch_model
Auditability	Request logs	Per-step decision traces

Use Cases

Inside-the-Loop Control. Influence decisions at every agent step — model call, tool call, sub-agent handoff — where most cost, delay, and failure actually happen. External proxies only see request boundaries; cascadeflow sees decision boundaries.
Multi-Dimensional Optimization. Optimize across cost, latency, quality, budget, compliance/risk, and energy simultaneously — relevant to engineering, finance, security, operations, and sustainability stakeholders.
Business Logic Injection. Embed KPI weights and policy intent directly into agent behavior at runtime. Shift AI control from static prompt design to live business governance.
Runtime Enforcement. Directly steer outcomes with four actions: allow, switch_model, deny_tool, stop — based on current context and policy state. Closes the gap between analytics and execution.
Auditability & Transparency. Every runtime decision is traceable and attributable. Supports audit requirements, faster tuning cycles, and trust in regulated or high-stakes workflows.
Measurable Value. Prove impact with reproducible metrics on realistic agent workflows — better economics and latency while preserving quality thresholds.
Latency Advantage. Proxy-based optimization adds 40-60ms per call. In a 10-step agent loop, that is 400-600ms of avoidable overhead. cascadeflow runs in-process with sub-5ms overhead — critical for real-time UX, task throughput, and enterprise SLAs.
Framework & Provider Neutral. Works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, Vercel AI SDK, n8n, Hermes Agent, and custom frameworks. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and more.
Self-Improving Agent Intelligence. Because cascadeflow runs inside the agent loop, it accumulates deep insight into every model call, tool result, quality score, and routing decision over time. This enables cascadeflow to learn which models perform best for which tasks, adapt routing strategies, and continuously improve cost-quality tradeoffs — without manual tuning. The agent gets smarter the more it runs.
Edge & Local-Hosted AI. Handle most queries with local models (vLLM, Ollama), automatically escalate complex queries to cloud providers only when needed.

ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper

How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (>17 providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

hljs language-java

┌─────────────────────────────────────────────────────────────┐
│                      cascadeflow Stack                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascade Agent                                        │  │
│  │                                                       │  │
│  │  Orchestrates the entire cascade execution            │  │
│  │  • Query routing & model selection                    │  │
│  │  • Drafter -> Verifier coordination                   │  │
│  │  • Cost tracking & telemetry                          │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Domain Pipeline                                      │  │
│  │                                                       │  │
│  │  Automatic domain classification                      │  │
│  │  • Rule-based detection (CODE, MATH, DATA, etc.)      │  │
│  │  • Optional ML semantic classification                │  │
│  │  • Domain-optimized pipelines & model selection       │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Quality Validation Engine                            │  │
│  │                                                       │  │
│  │  Multi-dimensional quality checks                     │  │
│  │  • Length validation (too short/verbose)              │  │
│  │  • Confidence scoring (logprobs analysis)             │  │
│  │  • Format validation (JSON, structured output)        │  │
│  │  • Semantic alignment (intent matching)               │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascading Engine (<2ms overhead)                     │  │
│  │                                                       │  │
│  │  Smart model escalation strategy                      │  │
│  │  • Try cheap models first (speculative execution)     │  │
│  │  • Validate quality instantly                         │  │
│  │  • Escalate only when needed                          │  │
│  │  • Automatic retry & fallback                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Provider Abstraction Layer                           │  │
│  │                                                       │  │
│  │  Unified interface for >17 providers                   │  │
│  │  • OpenAI • Anthropic • Groq • Ollama                 │  │
│  │  • Together • vLLM • HuggingFace • LiteLLM            │  │
│  │  • Vercel AI SDK (17+ additional providers)            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Harness API

Three tiers of integration — zero-change observability to full policy control:

Tier 1: Zero-change observability

hljs language-python

import cascadeflow
cascadeflow.init(mode="observe")
# All OpenAI/Anthropic SDK calls are now tracked. No code changes needed.

Tier 2: Scoped runs with budget

hljs language-python

with cascadeflow.run(budget=0.50, max_tool_calls=10) as session:
    result = await agent.run("Analyze this dataset")
    print(session.summary())  # cost, latency, energy, steps, tool calls
    print(session.trace())    # full decision audit trail

Tier 3: Decorated agents with policy

hljs language-python

@cascadeflow.agent(budget=0.20, compliance="gdpr", kpi_weights={"quality": 0.6, "cost": 0.3, "latency": 0.1})
async def my_agent(query: str):
    return await llm.complete(query)

Quick Start

Python

hljs language-python

pip install cascadeflow[all]

hljs language-python

from cascadeflow import CascadeAgent, ModelConfig

# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
    ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375),  # Draft model (~$0.375/1M tokens)
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),         # Verifier model (~$5.62/1M tokens)
])

# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")

print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")

💡 Optional: Use ML-based Semantic Quality Validation

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

Step 1: Install the optional ML package:

hljs language-bash

pip install cascadeflow[semantic]  # Adds semantic similarity via FastEmbed (~80MB model)

Step 2: Use semantic quality validation:

hljs language-python

from cascadeflow.quality.semantic import SemanticQualityChecker

# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
    similarity_threshold=0.5,  # Minimum similarity score (0-1)
    toxicity_threshold=0.7     # Maximum toxicity score (0-1)
)

# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."

result = checker.validate(query, response, check_toxicity=True)

print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")

What you get:

🎯 Semantic similarity scoring (query ↔ response alignment)
🛡️ Optional toxicity detection
🔄 Automatic model download and caching
🚀 Fast inference (~100ms per check)

Full example: See semantic_quality_domain_detection.py

⚠️ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

📖 Learn more: Python Documentation | Quickstart Guide | Providers Guide

TypeScript

hljs language-bash

npm install @cascadeflow/core

hljs language-tsx

import { CascadeAgent, ModelConfig } from '@cascadeflow/core';

// Same API as Python!
const agent = new CascadeAgent({
  models: [
    { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
});

const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);

💡 Optional: ML-based Semantic Quality Validation

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

Step 1: Install the optional ML packages:

hljs language-bash

npm install @cascadeflow/ml @huggingface/transformers

Step 2: Enable semantic validation in your cascade:

hljs language-tsx

import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
  quality: {
    threshold: 0.40,                    // Traditional confidence threshold
    requireMinimumTokens: 5,            // Minimum response length
    useSemanticValidation: true,        // Enable ML validation
    semanticThreshold: 0.5,             // 50% minimum similarity
  },
});

// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');

Step 3: Or use semantic validation directly:

hljs language-tsx

import { SemanticQualityChecker } from '@cascadeflow/core';

const checker = new SemanticQualityChecker();

if (await checker.isAvailable()) {
  const result = await checker.checkSimilarity(
    'What is TypeScript?',
    'TypeScript is a typed superset of JavaScript.'
  );

  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
  console.log(`Passed: ${result.passed}`);
}

What you get:

🎯 Query-response semantic alignment detection
🚫 Off-topic response filtering
📦 BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
⚡ Fast CPU inference (~50-100ms with caching)
🔄 Request-scoped caching (50% latency reduction)
🌐 Works in Node.js, Browser, and Edge Functions

Example: semantic-quality.ts

📖 Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples

🔄 Migration Example

Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.

Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

hljs language-python

# Using expensive model for everything
result = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

hljs language-python

agent = CascadeAgent(models=[
    ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])

result = await agent.run("What's 2+2?")

🔥 Saved: $0.000106 (94% reduction), 3.6x faster

📊 Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization

Drop-In Gateway (Existing Apps)

If you already have an app using the OpenAI or Anthropic APIs and want the fastest integration, run the gateway and point your existing client at it:

hljs language-bash

python -m cascadeflow.server --mode auto --port 8084

n8n Integration

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

Installation

Open n8n
Go to Settings → Community Nodes
Search for: @cascadeflow/n8n-nodes-cascadeflow
Click Install

Two Nodes

Node	Type	Use case
CascadeFlow (Model)	Language Model sub-node	Drop-in for any Chain/LLM node
CascadeFlow Agent	Standalone agent (`main` in/out)	Tool calling, memory, multi-step reasoning

Quick Start (Model):

Add two AI Chat Model nodes (cheap drafter + powerful verifier)
Add CascadeFlow (Model) and connect both models
Connect to Basic LLM Chain or Chain node
Check Logs tab on the Chain node to see cascade decisions

Quick Start (Agent):

Add a Chat Trigger node
Add CascadeFlow Agent and connect it to the trigger
Connect Drafter, Verifier, optional Memory and Tools
Check the Agent Output tab for cascade metadata and trace

Result: 40-85% cost savings in your n8n workflows!

Features:

Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
Mix providers (e.g., Ollama drafter + GPT-4o verifier)
Agent node: tool calling, memory, per-tool routing, tool call validation
16-domain cascading for specialized model routing
Real-time flow visualization in Logs/Output tabs

🔌 Learn more: n8n Integration Guide | n8n Package

Hermes Agent Integration

hljs language-python

from cascadeflow.integrations.hermes import (
    HermesDelegationRequest,
    HermesDelegationRouter,
)

router = HermesDelegationRouter.from_dict({
    "enabled": True,
    "mode": "observe",
    "routes": {
        "code": {
            "provider": "nous",
            "model": "nous/hermes-4.1",
            "reasoning_effort": "high",
        },
        "simple": {
            "provider": "openai",
            "model": "gpt-4.1-mini",
            "reasoning_effort": "low",
        },
    },
})

decision = router.route_delegation(HermesDelegationRequest(
    goal="Debug the failing unit test and propose a patch",
    toolsets=("terminal", "git"),
    loaded_skills=("python", "debugging"),
))
print(decision.to_dict())

What Hermes gets:

Per-skill model routing: coding, research, legal/finance, and lightweight utility skills can receive different model and reasoning profiles instead of inheriting one global default.
Task-complexity routing: simple delegated tasks can use cheaper/faster models, while hard debugging, architecture, research, or code-generation tasks route to stronger models.
Topic-aware subagent routing: subagents can route differently for code, research, data, creative, ops, medical, legal, finance, and other domains.
Better subagent economics: Hermes avoids paying flagship-model prices for simple worker tasks.
Better quality for hard tasks: difficult subagent work no longer has to inherit a weak or cheap default model.
Dry-run/observe mode: Hermes users can see what CascadeFlow would route without changing runtime behavior.
Auditability: routing decisions include reason, confidence, domain, complexity, and selected model.
Safer rollout: missing CascadeFlow, disabled config, low confidence, high-stakes gaps, or bad routing inputs fall back to Hermes' current behavior.
No credential rewrite: Hermes still owns provider credentials, base URLs, fallback chains, and API modes.

Learn more: Hermes Agent Integration Guide

Standalone example: examples/integrations/hermes_delegation_router.py

LangChain Integration

Use cascadeflow with LangChain for intelligent model cascading with full LCEL, streaming, and tools support!

Installation

TypeScript

hljs language-bash

npm install @cascadeflow/langchain @langchain/core @langchain/openai

Python

hljs language-bash

pip install cascadeflow langchain-openai

Quick Start

TypeScript - Drop-in replacement for any LangChain chat model

hljs language-typescript

import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';

const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'nous/hermes-flash' }),      // $0.15/$0.60 per 1M tokens
  verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }),  // $3/$15 per 1M tokens
  qualityThreshold: 0.8, // 80% queries use drafter
});

// Use like any LangChain chat model
const result = await cascade.invoke('Explain quantum computing');

// Optional: Enable LangSmith tracing (see https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

// Or with LCEL chains
const chain = prompt.pipe(cascade).pipe(new StringOutputParser());

Python - Drop-in replacement for any LangChain chat model

hljs language-python

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from cascadeflow.integrations.langchain import CascadeFlow

cascade = CascadeFlow(
    drafter=ChatOpenAI(model="nous/hermes-flash"),      # $0.15/$0.60 per 1M tokens
    verifier=ChatAnthropic(model="claude-sonnet-4-5"),  # $3/$15 per 1M tokens
    quality_threshold=0.8,  # 80% queries use drafter
)

# Use like any LangChain chat model
result = await cascade.ainvoke("Explain quantum computing")

# Optional: Enable LangSmith tracing (see https://smith.langchain.com)
# Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true

# Or with LCEL chains
chain = prompt | cascade | StrOutputParser()

💡 Optional: Cost Tracking with Callbacks (Python)

Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:

hljs language-python

from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback

# Track costs similar to get_openai_callback()
with get_cascade_callback() as cb:
    response = await cascade.ainvoke("What is Python?")

    print(f"Total cost: ${cb.total_cost:.6f}")
    print(f"Drafter cost: ${cb.drafter_cost:.6f}")
    print(f"Verifier cost: ${cb.verifier_cost:.6f}")
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Successful requests: {cb.successful_requests}")

Features:

🎯 Compatible with get_openai_callback() pattern
💰 Separate drafter/verifier cost tracking
📊 Token usage (including streaming)
🔄 Works with LangSmith tracing
⚡ Near-zero overhead

Full example: See langchain_cost_tracking.py

💡 Optional: Model Discovery & Analysis Helpers (TypeScript)

For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:

hljs language-typescript

import {
  discoverCascadePairs,
  findBestCascadePair,
  analyzeModel,
  validateCascadePair
} from '@cascadeflow/langchain';

// Your existing LangChain models (configured with YOUR API keys)
const myModels = [
  new ChatOpenAI({ model: 'gpt-3.5-turbo' }),
  new ChatOpenAI({ model: 'nous/hermes-flash' }),
  new ChatOpenAI({ model: 'gpt-4o' }),
  new ChatAnthropic({ model: 'claude-3-haiku' }),
  // ... any LangChain chat models
];

// Quick: Find best cascade pair
const best = findBestCascadePair(myModels);
console.log(`Best pair: ${best.analysis.drafterModel} → ${best.analysis.verifierModel}`);
console.log(`Estimated savings: ${best.estimatedSavings}%`);

// Use it immediately
const cascade = withCascade({
  drafter: best.drafter,
  verifier: best.verifier,
});

// Advanced: Discover all valid pairs
const pairs = discoverCascadePairs(myModels, {
  minSavings: 50,              // Only pairs with ≥50% savings
  requireSameProvider: false,  // Allow cross-provider cascades
});

// Validate specific pair
const validation = validateCascadePair(drafter, verifier);
console.log(`Valid: ${validation.valid}`);
console.log(`Warnings: ${validation.warnings}`);

What you get:

🔍 Automatic discovery of optimal cascade pairs from YOUR models
💰 Estimated cost savings calculations
⚠️ Validation warnings for misconfigured pairs
📊 Model tier analysis (drafter vs verifier candidates)

Full example: See model-discovery.ts

Features:

✅ Full LCEL support (pipes, sequences, batch)
✅ Streaming with pre-routing
✅ Tool calling and structured output
✅ LangSmith cost tracking metadata
✅ Cost tracking callbacks (Python)
✅ Works with all LangChain features

🦜 Learn more: LangChain Integration Guide | TypeScript Package | Python Examples

Resources

Examples

Python Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup with OpenAI models	View
Preset Usage	Use built-in presets for quick setup	View
Tool Execution	Function calling and tool usage	View
Streaming Text	Stream responses from cascade agents	View
Cost Tracking	Track and analyze costs across queries	View
Agentic Multi-Agent	Multi-turn tool loops & agent-as-a-tool delegation	View
Multi-Step Cascade	Multi-step agent loops with tool calls	View

Harness & Enforcement - Budget, compliance, and agent governance

Example	Description	Link
Budget Enforcement	Budget caps with stop actions in enforce mode	View
User Budget Tracking	Per-user budget enforcement and tracking	View
Guardrails	Safety and content guardrails	View
Rate Limiting	Rate limiting for cascades	View
User Profile Usage	User-specific routing and configurations	View
Stripe Integration	Billing integration with budget enforcement	View

Framework Integrations - Harness with LangChain, OpenAI Agents, CrewAI, PydanticAI, Google ADK, Hermes Agent

Example	Description	Link
LangChain Harness	cascadeflow harness with LangChain callback handler	View
OpenAI Agents Harness	cascadeflow harness with OpenAI Agents SDK	View
CrewAI Harness	cascadeflow harness with CrewAI hooks	View
PydanticAI Harness	cascadeflow cascade Model with PydanticAI agents	View
Google ADK Harness	cascadeflow harness with Google ADK plugin	View
LangChain Basic	Simple LangChain cascade setup	View
LangChain LCEL Pipeline	LCEL chains with cascade routing	View
LangGraph Multi-Agent	LangGraph multi-agent orchestration	View

Advanced Examples - Production, providers & customization

Example	Description	Link
Production Patterns	Best practices for production deployments	View
Multi-Provider	Mix multiple AI providers in one cascade	View
Reasoning Models	Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)	View
Streaming Tools	Stream tool calls and responses	View
Batch Processing	Process multiple queries efficiently	View
FastAPI Integration	Integrate cascades with FastAPI	View
Edge Device	Run cascades on edge devices with local models	View
vLLM Example	Use vLLM for local model deployment	View
Multi-Instance Ollama	Run draft/verifier on separate Ollama instances	View
Custom Cascade	Build custom cascade strategies	View
Custom Validation	Implement custom quality validators	View
Semantic Quality Detection	ML-based domain and quality detection	View
Cost Forecasting	Forecast costs and detect anomalies	View

TypeScript Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup (Node.js)	View
Tool Calling	Function calling with tools (Node.js)	View
Multi-Provider	Mix providers in TypeScript (Node.js)	View
Reasoning Models	Use reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)	View
Cost Tracking	Track and analyze costs across queries	View
Semantic Quality	ML-based semantic validation with embeddings	View
Streaming	Stream responses in TypeScript	View
Tool Execution	Tool execution engine and result handling	View
Streaming Tools	Stream tool calls with event detection	View
Agentic Multi-Agent	Multi-turn tool loops & multi-agent orchestration	View

Advanced Examples - Production, edge & LangChain

Example	Description	Link
Production Patterns	Production best practices (Node.js)	View
Multi-Instance Ollama	Run draft/verifier on separate Ollama instances	View
Multi-Instance vLLM	Run draft/verifier on separate vLLM instances	View
Browser/Edge	Vercel Edge runtime example	View
LangChain Basic	Simple LangChain cascade setup	View
LangChain Cross-Provider	Haiku → GPT-5 with PreRouter	View
LangChain LangSmith	Cost tracking with LangSmith	View
LangChain Cost Tracking	Compare cascadeflow vs LangSmith cost tracking	View
LangGraph Multi-Agent	LangGraph multi-agent orchestration	View
LangChain Tool Risk Gating	Tool routing based on risk and complexity	View

📂 View All Python Examples → | View All TypeScript Examples →

Documentation

Getting Started - Core concepts and basics

Guide	Description	Link
Quickstart	Get started with cascadeflow in 5 minutes	Read
Providers Guide	Configure and use different AI providers	Read
Presets Guide	Using and creating custom presets	Read
Streaming Guide	Stream responses from cascade agents	Read
Tools Guide	Function calling and tool usage	Read
Cost Tracking	Track and analyze API costs	Read
Agentic Patterns	Tool loops, multi-agent, agent-as-a-tool delegation	Read
Agent Harness	Budget, compliance, KPI, and energy controls	Read
Rollout Guide	Plan your production rollout	Read

Advanced Topics - Production, customization & integrations

Guide	Description	Link
Production Guide	Best practices for production deployments	Read
Enterprise Networking	Proxy, TLS, and network configuration	Read
Customization	Custom cascade strategies and validators	Read
Observability	Telemetry, logging, and privacy controls	Read
LangChain Integration	Use cascadeflow with LangChain	Read
OpenAI Agents SDK	Use cascadeflow with OpenAI Agents	Read
CrewAI Integration	Use cascadeflow with CrewAI	Read
PydanticAI Integration	Cascade Model for PydanticAI agents	Read
Google ADK	Use cascadeflow with Google ADK	Read
Hermes Agent	Per-skill, complexity, and topic-aware subagent routing	Read
n8n Integration	Use cascadeflow in n8n workflows	Read
Vercel AI SDK	Middleware for Vercel AI SDK	Read

📚 View All Documentation →

Features

Feature	Benefit
🎯 Speculative Cascading	Tries cheap models first, escalates intelligently
💰 40-85% Cost Savings	Research-backed, proven in production
⚡ 2-10x Faster	Small models respond in <50ms vs 500-2000ms
⚡ Low Latency	Sub-2ms framework overhead, negligible performance impact
🔄 Mix Any Providers	OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) + LangChain integration
👤 User Profile System	Per-user budgets, tier-aware routing, enforcement callbacks
✅ Quality Validation	Automatic checks + semantic similarity (optional ML, ~80MB, CPU)
🎨 Cascading Policies	Domain-specific pipelines, multi-step validation strategies
🧠 Domain Understanding	15 domains auto-detected (code, medical, legal, finance, math, etc.), routes to specialists
🤖 Drafter/Validator Pattern	20-60% savings for agent/tool systems
🔧 Tool Calling Support	Universal format, works across all providers
📊 Cost Tracking	Built-in analytics + OpenTelemetry export (vendor-neutral)
🚀 3-Line Integration	Zero architecture changes needed
🔁 Agent Loops	Multi-turn tool execution with automatic tool call, result, re-prompt cycles
🧭 Hermes Agent Routing	Per-skill, task-complexity, and topic-aware subagent routing with observe-mode rollout
📋 Message & Tool Call Lists	Full conversation history with tool_calls and tool_call_id preservation across turns
🪝 Hooks & Callbacks	Telemetry callbacks, cost events, and streaming hooks for observability
🏭 Production Ready	Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection
💳 Budget Enforcement	Per-run and per-user budget caps with automatic stop actions when limits are exceeded
🔒 Compliance Gating	GDPR, HIPAA, PCI, and strict model allowlists — block non-compliant models before execution
📊 KPI-Weighted Routing	Inject business priorities (quality, cost, latency, energy) as weights into every model decision
🌱 Energy Tracking	Deterministic compute-intensity coefficients for carbon-aware AI operations
🔍 Decision Traces	Full per-step audit trail: action, reason, model, cost, budget state, enforcement status
⚙️ Harness Modes	off / observe / enforce — roll out safely with observe, then switch to enforce when ready

License

Free for commercial use. Attribution appreciated but not required.

Contributing

We ❤️ contributions!

📝 Contributing Guide - Python & TypeScript development setup

Recently Shipped

✅ Agent Loops & Multi-Agent - Multi-turn tool execution, agent-as-a-tool delegation, LangGraph orchestration
✅ Tool Execution Engine - Automatic tool call routing, parallel execution, risk gating
✅ Hooks & Callbacks - Telemetry callbacks, cost events, streaming hooks for observability
✅ Vercel AI SDK Integration - 17+ additional providers with automatic provider detection
✅ OpenClaw Provider - Custom provider for OpenClaw deployments
✅ Gateway Server - Drop-in OpenAI/Anthropic-compatible proxy endpoint
✅ User Tier Management - Cost controls and limits per user tier with advanced routing
✅ Semantic Quality Validators - Lightweight local quality scoring via FastEmbed
✅ Code Complexity Detection - Dynamic cascading based on task complexity analysis
✅ Domain Aware Cascading - ML-based semantic domain detection with per-domain routing

Support

📖 GitHub Discussions - Searchable Q&A
🐛 GitHub Issues - Bug reports & feature requests
📧 Email Support - Direct support

Citation

If you use cascadeflow in your research or project, please cite:

hljs language-bibtex

@software{cascadeflow2025,
  author = {Lemony Inc., Sascha Buehrle and Contributors},
  title = {cascadeflow: Agent runtime intelligence layer for AI agent workflows},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/lemony-ai/cascadeflow}
}

Ready to cut your AI costs by 40-85%?

hljs language-bash

pip install cascadeflow

hljs language-bash

npm install @cascadeflow/core

Read the Docs • View Python Examples • View TypeScript Examples • Join Discussions

About

Built with ❤️ by Lemony Inc. and the cascadeflow Community

One cascade. Hundreds of specialists.

New York | Zurich

⭐ Star us on GitHub if cascadeflow helps you save money!

cascadeflow

Agent Runtime Intelligence Layer

Hermes Agent delegation cascading

Why cascadeflow?

Proxy vs In-Process Harness

How cascadeflow Works

Harness API

Quick Start

Python

TypeScript

🔄 Migration Example

Before (Standard Approach)

After (With cascadeflow)

Installation

Two Nodes

Installation

Quick Start

Resources

Examples

Documentation

Features

License

Contributing

Recently Shipped

Support

Citation

About

Similar Packages

cascadeflow

Agent Runtime Intelligence Layer

Hermes Agent delegation cascading

Why cascadeflow?

Proxy vs In-Process Harness

How cascadeflow Works

Harness API

Quick Start

Python

TypeScript

🔄 Migration Example

Before (Standard Approach)

After (With cascadeflow)

Installation

Two Nodes

Installation

Quick Start

Resources

Examples

Documentation

Features

License

Contributing

Recently Shipped

Support

Citation

About

Similar Packages