What This Covers
AI cost governance is the layer that sits on top of model routing and turns AI spend into something you can see, attribute, and control. This page describes what an admin or operator actually sees in the platform today, the schema and telemetry behind those views, and what is still in progress.
Two adjacent pages are useful:
- Model Routing & Lifecycle — how models reach the routing pool and how a single call is selected. Cost governance assumes routing is doing its job; this page is about everything that happens after a call is made.
- Finance — AI Spend — the supplier and contract view of the same provider activity, owned by Finance. The cost-governance page is the AI-workforce-side view; AI Spend is the AP-side view. They share the same underlying events.
The full design lives in the spec at docs/superpowers/specs/2026-05-19-ai-cost-governance.md. Tracking epic is EP-COST-001.
The Three Cost Pools
The platform’s AI spend lives in three distinct pools with different unit economics. Operating any one of them in isolation produces decisions that worsen another, so the governance surface treats them as one model with three meters:
| Pool | What it is | How it bills | Primary lever |
|---|---|---|---|
| A — DPF internal API | Every inference call made through callProvider() — coworker turns, Build Studio agents, background jobs. Uses provider API keys. |
Per token at market rates (Anthropic, OpenAI, OpenRouter, others). | Prompt caching, model tier ladder, context compaction. |
| B — Claude Code CLI | Every Claude Code session — Build Studio autonomous runs, maintenance sessions, conversation with the user. Shares one bucket with claude.ai. | Message/usage rate limit on the Anthropic subscription. No per-token billing. | System-prompt size, MCP servers attached, extended-thinking discipline. |
| C — Codex CLI | Every Codex CLI invocation for Build Studio sandbox execution and code generation. | Message-based credit system on a 5-hour rolling window (Plus/Pro), or per-token on Business/Enterprise. | AGENTS.md depth, attached MCP servers, model selection (GPT-5.4 vs 5.5). |
Pool A is the only pool that produces real dollar amounts inside the platform. Pools B and C produce rate-limit events; the platform records those events so the operator can correlate “the build stalled” with “the CLI pool was throttled.”
What An Operator Sees Today
Build Studio — per-phase cost breakdown
Every Build Studio run records cost on a per-phase basis. The Cost & Tokens card on the build detail view shows input tokens, output tokens, prompt-cache reads, and estimated USD for each of the five phases (Ideate, Plan, Build, Review, Ship). The same numbers are available via the MCP tool surface for external reporting.
The underlying schema:
| Table | Records |
|---|---|
BuildPhaseRun |
One row per phase per build. Fields: phaseId, buildRunId, inputTokens, outputTokens, cacheReadTokens, costUsd, agentIds, startedAt, finishedAt. |
ToolExecution |
One row per tool call. Now includes inputTokens, outputTokens, costUsd for tools that internally call callProvider(). |
AdapterRunTelemetry |
One row per provider call. Includes cacheCreationInputTokens and cachedInputTokens (Anthropic prompt-cache extraction). |
AgentBudgetEvent |
One row per budget pressure event — soft alert, hard pre-dispatch block, or CLI-pool rate-limit hit. |
Prompt-cache hit/miss telemetry
The platform extracts cache_creation_input_tokens and cache_read_input_tokens from every Anthropic API response and surfaces them on AdapterRunTelemetry. Two Prometheus counters expose the same data at /api/metrics:
dpf_ai_cache_read_tokens_total— tokens served from cache at 10% of the normal ratedpf_ai_cache_creation_tokens_total— tokens written to cache at 1.25× the normal rate
If the read counter is non-zero for an Anthropic-backed agent, caching is working. If it stays at zero while the creation counter rises, the dynamic-context boundary is incorrectly placed and caching never hits — the operator’s signal to investigate the prompt assembler.
Finance AP rollup
When the runtime records provider usage, the same data also flows into Finance as actual AP spend. The supplier link, contract posture, monthly commitment, and any open work items raised by usage evaluation live at /finance/spend/ai. The numbers there should match the per-agent and per-phase totals in the AI Workforce views — they are the same events viewed from a different role.
Operations Map — cost pressure overlays
The Operations Map at /platform/ai/operations overlays the three cost pools onto the route topology. A scheduled-window forecast shows when the CLI pool is expected to spike (from planned Build Studio runs); a quota-pressure indicator turns yellow when the bucket is depleting faster than the rolling window will refill it.
How Token Consumption Is Kept Bounded
Two compaction mechanisms ship today; both run automatically and require no operator action under normal conditions.
Phase-boundary summarization
Before a Build Studio specialist hands off to the next phase, the orchestrator runs compactPhase(threadId, phaseId). The completed phase’s messages are summarized to a 200–300 token block by a routine-tier model, and the working context for the next specialist starts from that summary rather than the full transcript. The full transcript is preserved in the database for audit; only the working context is compacted.
This caps a 4-phase build at roughly 1,200 tokens of prior-phase history regardless of how much conversation happened inside any one phase.
Rolling coworker thread compaction
Coworker threads in the portal accumulate indefinitely without a cap. When the assembled message list exceeds 20 turns, the platform summarizes the oldest 10 into a single context-summary message using a routine-tier model. The trigger re-fires at turn 21, 31, 41, and so on, so the working context stays bounded for the rest of the thread.
The summary message is stored on the Thread as a special-typed message so it is visible to auditors and excluded from compaction itself.
Tool result trimming
Many tool calls return large JSON payloads — backlog queries, wiki results, provider lists, code-graph results. A trimming utility caps tool results at 2,000 tokens by default, configurable per tool in the registry, and logs trimmedTokens to ToolExecution so the cost saved is visible.
The Model Tier Ladder
The cost-tier vocabulary decouples agent definitions from specific model IDs. Each agent has a costTier in its registry profile; the routing layer resolves that tier to a concrete model through ModelTierPolicy.
| Tier | Anthropic | OpenAI | Use cases |
|---|---|---|---|
critical |
claude-opus-4-6 | gpt-5.5 | Creative ideation, architecture decisions, final review, root-cause analysis requiring broad reasoning. |
standard |
claude-sonnet-4-6 | gpt-5.4 | Most Build Studio specialist work: implement, test, review. |
routine |
claude-haiku-4-5 | gpt-5.4-mini | Tool dispatch, format transforms, status checks, routing decisions, simple confirmations, structured extraction. |
When a new model in any tier ships, one row in ModelTierPolicy updates every agent on that tier — agent profiles do not need to be touched.
An agent can carry an optional model_id_override for tasks that genuinely require a pinned model (for example, an agent that uses extended thinking, which only runs on Sonnet+). The override is logged in AdapterRunTelemetry.overrideReason so drift is auditable.
Budget Events and the CLI Pool Gate
The platform records two kinds of budget signals today:
- Soft alerts. A per-agent or per-build-phase soft threshold writes an
AgentBudgetEventwithseverity: "warn". The build continues; the operator sees the event in the Build Studio history strip. - Pre-dispatch rate-limit checks. Before dispatching work to the Claude Code CLI or Codex CLI pool, the orchestrator inspects the rolling-window state. If the pool is depleted past the configured cushion, the dispatch is deferred and an
AgentBudgetEventwithseverity: "blocked"is written. The Build Studio queue parks the work until the window refills.
This is the boundary between “what shipped” and “what’s still in progress.” Hard budget enforcement at callProvider() — where a per-agent or per-phase USD ceiling refuses to dispatch — is the closing piece. Until it lands, the soft-alert + pre-dispatch-check pair is the operator’s visibility into budget pressure; the operator is still the enforcement point.
What’s In Progress
Tracked under EP-COST-001:
- Hard budget enforcement at
callProvider()— refuse to dispatch when a per-agent or per-phase USD ceiling is exceeded, with operator override surfaces. - Profile-aware AGENTS.md subset for Codex sessions — reduce CLI credit consumption by feeding Codex a context-appropriate slice rather than the full nested AGENTS.md tree.
- CLI-pool scheduling — schedule intensive autonomous runs into off-peak windows so they do not starve the operator’s conversation budget.
- Local-LLM grading — let bundled local models grade local work to reduce dependence on cloud providers for evaluation traffic.
Related Routes
/build/[id]→ Cost & Tokens tab — per-phase breakdown for a specific build/platform/ai/operations— Operations Map with cost-pool overlays/platform/ai/authority→ Tool Execution Log — per-tool token + cost attribution/finance/spend/ai— Finance-owned supplier and contract view of the same activity/api/metrics— Prometheus scrape endpoint exposing the cost and cache counters
Related Specs
docs/superpowers/specs/2026-05-19-ai-cost-governance.md— the full EP-COST-001 designdocs/superpowers/specs/2026-03-18-ai-routing-and-profiling-design.md— the routing foundation this layer sits ondocs/founder-kernel/wiki/principles/responsible-capacity-utilization.md— the governing principle