AI Coworker Development Principles
Status: Foundational specification Created: 2026-03-31 Authors: Mark Bodman, Claude (Software Engineer) References: Diversity of Thought Framework, IT4IT v3.0.1, EP-BUILD-HANDOFF spec
Purpose
This document defines the architectural principles for developing AI Coworker agents within the Digital Product Factory. It is the governing specification for all future agent design, tool assignment, memory strategy, and multi-agent orchestration.
These principles are derived from production testing, industry framework research (Anthropic Agent SDK, OpenAI Agents SDK, LangGraph, CrewAI, AutoGen), and the platform’s Diversity of Thought framework.
Principle 1: Specialization Over Generalization
A specialist with 5 focused tools outperforms a generalist with 40.
Rule
Each AI Coworker agent should have access to no more than 10 tools relevant to its current task. When tool count exceeds 15, tool selection accuracy degrades significantly, regardless of model capability.
Implementation
Tools are tagged with the phases and contexts in which they are relevant. The platform filters the tool list before each agent invocation, presenting only the tools the agent needs for its current role.
type ToolDefinition = {
name: string;
buildPhases?: BuildPhaseTag[]; // Only available during these phases
// ... other fields
};
Evidence
- Haiku with 40+ tools entered repetition loops calling wrong tools
- Haiku with 5-9 phase-filtered tools correctly generated code and called sandbox tools
- Industry consensus (Azure, Redis, all major frameworks): 3-5 tools per specialist is optimal
Principle 2: Orchestrator-Worker Pattern
A coordinator routes work to specialists. Specialists do not route to each other.
Rule
Multi-step workflows use a hierarchical orchestrator-worker pattern. The build pipeline acts as the orchestrator, dispatching each phase to the appropriate specialist agent. Agents do not hand off directly to each other — the orchestrator mediates all transitions.
Implementation
Each build phase maps to a specialist agent:
| Phase | Agent Role | IT4IT Alignment | Model Tier |
|---|---|---|---|
| Ideate | Product Designer | §5.2.1 Conceptualize | Standard (Haiku) |
| Plan | Architect | §5.2.4 Define Architecture | Standard (Haiku) |
| Build | Software Engineer | §5.3.3 Design & Develop | Frontier (Sonnet) |
| Review | QA / Scrum Master | §5.3.5 Accept & Publish | Standard (Haiku) |
| Ship | Operations Engineer | §5.4 Deploy + §5.5 Release | Standard (Haiku) |
Rationale
- Simple phases (ideate, plan, review, ship) are deterministic workflows that smaller models handle well
- The build phase requires complex multi-step tool reasoning and code generation — it needs a stronger model
- This matches the industry pattern: cheap models for routing, expensive models for complex reasoning
- Token budget is 3-4x lower per call, enabling more iterations within rate limits
Principle 3: Structured Handoffs, Not Conversation History
Pass decisions and context, not transcripts.
Rule
When work transitions between agents (or between phases), the outgoing agent produces a structured handoff document. The incoming agent reads only this document — never the raw conversation history from the previous phase.
Implementation
interface PhaseHandoff {
fromPhase: BuildPhase;
toPhase: BuildPhase;
summary: string; // 2-3 sentences, plain language
evidence: Record<string, unknown>; // Phase-specific artifacts
openIssues: string[]; // What the next agent should know
userPreferences: string[]; // Decisions the user made
}
Rationale
- Raw conversation history wastes tokens on irrelevant context (ideate discussion during build phase)
- Structured handoffs capture what matters: decisions, evidence, and user intent
- Each agent starts with a clean context window focused on its task
- Token reduction: ~16K per call → ~4K per call (3-4x improvement)
Principle 4: Diversity of Thought in Agent Design
Different agents should think differently, not just have different tools.
Rule
Each agent’s system prompt defines three cognitive components from the Diversity of Thought framework:
| Component | What it defines | Example |
|---|---|---|
| Perspective | How the agent frames the problem | Software Engineer sees “code structure”; Ops Engineer sees “deployment safety” |
| Heuristics | Strategies for finding solutions | Engineer uses test-driven development; Ops uses rollback-first deployment |
| Interpretive Model | What “good” means | Engineer optimizes for correctness; Ops optimizes for availability |
Implementation
Agent system prompts explicitly declare their perspective, heuristics, and success criteria. This is not decorative — it determines which solutions the agent considers and which it misses.
When a complex problem requires multiple perspectives (a rugged landscape in Diversity of Thought terms), the orchestrator consults multiple specialists before deciding. The combined output exceeds what any single agent would produce.
Rationale
- A team of diverse “good enough” agents outperforms a single “best” agent on complex problems
- Different perspectives reveal different solution peaks
- The IT4IT value streams already define distinct roles with different optimization targets
- This prevents the failure mode where every agent gives the same generic answer
Principle 5: Selective Memory, Not Total Recall
Remember decisions and rationale. Re-derive details from source.
Rule
The vector database (Qdrant) stores salient context — decisions, user preferences, design rationale, and cross-conversation insights. It does not store raw conversation transcripts, code content, or data that can be derived from the codebase or git history.
What to Store
| Store | Example | Why |
|---|---|---|
| User decisions | “User chose in-memory state over database for this demo” | Informs future suggestions |
| Design rationale | “Complaints tracker uses client-side state because it’s a demo feature” | Prevents re-litigating decisions |
| Cross-conversation context | “The promoter image is JIT-built from the portal container” | Connects knowledge across sessions |
| Discovered constraints | “Anthropic subscription only gives Haiku access” | Prevents repeated failures |
| Quality patterns | “This user prefers Tailwind over CSS modules” | Personalizes agent behavior |
What NOT to Store
| Skip | Example | Why |
|---|---|---|
| Raw conversation | “User said: build it now…” | Ephemeral, bulky, low signal |
| Code content | “The complaints page contains…” | Read from sandbox or git |
| Build artifacts | Test output, diffs, logs | Stored in FeatureBuild record |
| Transient state | “Build is in plan phase” | Query the database |
Implementation
Each agent stores memories at natural decision points — not after every exchange. The memory is tagged with the agent role, build phase, and topic so retrieval is contextual.
Semantic recall uses the query context (current conversation + build phase) to retrieve the 5-8 most relevant memories. This is sufficient because memories are distilled to decisions and rationale, not raw detail.
Rationale
- Token efficiency: memories should be dense (high information per token)
- Retrieval quality: fewer, more relevant memories beat many marginally relevant ones
- The details are always available from primary sources (codebase, git, database)
- Memory serves as an index into knowledge, not a copy of it
Principle 6: Tools Must Be Self-Documenting
If the model can’t understand a tool from its schema, the schema is wrong.
Rule
Every tool definition includes:
- A description that explains what it does and when to use it
- Parameter descriptions with types, examples, and constraints
- Required parameters clearly marked
The build phase system prompt includes a tool usage guide that maps common tasks to specific tools with parameter examples.
Implementation
TOOL GUIDE:
- To create a new file: write_sandbox_file(path, content) — content is the FULL file
- To modify existing file: read first, then edit_sandbox_file(path, old_text, new_text)
- To run commands: run_sandbox_command(command)
Rationale
- Smaller models (Haiku) rely heavily on description quality for tool selection
- A model that sees
write_sandbox_filewithcontent: "The full file content to write"knows to pass the entire file - A model that sees only
content: stringmay omit it or pass a description instead - This is the difference between a tool call succeeding and entering a retry loop
Principle 7: Human-in-the-Loop at Phase Boundaries
The human approves transitions, not individual tool calls.
Rule
Human approval gates exist at phase boundaries (ideate → plan, plan → build, review → ship), not at individual tool calls within a phase. Within a phase, the agent operates autonomously using its scoped tools.
Exception: executionMode: "proposal" tools present a card for approval before executing side effects that affect production (deploying to production, registering products, modifying user data).
Implementation
- Phase transitions require the agent to save evidence and pass a quality gate
- Quality gates are deterministic checks (design review required, tests must pass)
- The human sees a summary and approves/rejects/requests changes
- Within a phase, the agent calls tools freely without per-call approval
Rationale
- Per-call approval breaks the agent’s reasoning flow and wastes the user’s time
- Phase-boundary approval gives the human meaningful decision points
- Proposal tools handle the exceptions where individual actions need approval
- This matches the IT4IT value stream gate model
Principle 8: Fail Fast, Explain Clearly
Stop on the first error. Don’t retry blindly. Tell the user what happened.
Rule
When a tool call fails, the agent should:
- Report the error in plain language
- Explain what it was trying to do
- Suggest what the user can do (if applicable)
- Stop — do not retry the same call with the same arguments
The agentic loop enforces a tool repetition limit (3-5 calls of the same tool). This is a safety net, not a feature — agents should not need it if they handle errors correctly.
Rationale
- Blind retries waste tokens and rate limit budget
- Users need to understand what happened to provide guidance
- The repetition limit exists because smaller models sometimes loop — but well-prompted agents with focused tool sets rarely trigger it
Principle 9: Responsible Capacity Utilization
Use paid AI capacity for governed value, not empty activity.
Rule
AI coworkers should treat available paid capacity as an operating asset. When authorized work is available, idle capacity is waste. When no useful, safe, evidence-producing work is available, the coworker should record or surface the blocker rather than spend tokens to appear busy.
Useful capacity work includes:
- reducing human cognitive load
- advancing approved backlog work
- producing durable work products
- running verification and capturing evidence
- reviewing stale specs, plans, PRs, or runtime state
- identifying capability gaps
- converting repeated work into proceduralization candidates
Implementation
Capacity use should be driven by Standing Orders, calendar/availability state, safe work queues, and existing authority controls. Coworkers may continue low-risk governed work when humans are unavailable, but must stop at approval boundaries for consequential actions.
Rationale
A salaried employee who does nothing while valuable work exists wastes organizational capacity. Fixed-price or subscription AI capacity has the same economic shape. The goal is not to burn tokens. The goal is to convert available capacity into reviewed work, evidence, learning, and platform improvement.
Application
These principles apply to:
- All new agent development in the Build Studio pipeline
- AI Coworker conversations across all platform pages
- Agent tool registration and schema design
- Memory and context management
- Multi-agent orchestration and handoff
- External coding agents working on DPF, including Codex and Claude, through the canonical project rulebook
When these principles conflict with expediency, the principles win. A well-structured agent that works reliably is worth more than a quick hack that fails unpredictably.