12-Factor Agents
Production-Ready AI Systems
Great AI agents achieve production reliability through modular architecture,
observable execution loops, and systematic evaluation frameworks.
What You'll Learn
- ✓ Architecture patterns that separate reliable automation from expensive failures
- ✓ Observability infrastructure that transforms debugging from impossible to tractable
- ✓ Evaluation frameworks that enable data-driven iteration at 5× velocity
- ✓ Production deployment patterns from enterprises running 245M+ agent interactions
Introduction: The Production AI Crisis
TL;DR
- • 40% of AI agent projects fail to reach production—the gap isn't your LLM choice or prompt engineering, it's architectural
- • Great AI agents are engineered systems requiring proper architecture, observability infrastructure, and systematic evaluation—not magic
- • The 12-Factor methodology adapted for AI agents provides a proven framework for production reliability
Your AI agent demos perfectly. It books appointments, answers questions, and coordinates tasks like magic. Then you push to production and it falls apart—hallucinating data, making wrong API calls, or getting stuck in loops you can't debug.
You're not alone. Industry data suggests 40% of AI agent projects fail to reach production. The gap isn't your LLM choice or prompt engineering—it's architectural.
The False Promise of "Just Better Prompts"
Most teams approaching agent failures ask: "How do I write better prompts?" or "Should I try a different model?" These questions miss the fundamental issue. Great AI agents aren't LLMs with tools. They're engineered systems requiring proper architecture, observability infrastructure, and systematic evaluation.
"Production AI systems require systematic engineering: proper architecture, context optimization, robust evaluation, human oversight patterns, and obsessive focus on UX and observability—not just prompt engineering."— From "AI That Works" Podcast, synthesizing production learnings
Why Agents Fail in Production
Research and production experience reveal consistent patterns:
❌ Observability Blind Spots
Agent failures are impossible to debug without execution traces. Teams waste engineering time manually testing prompts to find failures, then risk breaking other functionality when making changes.
Impact: Weeks to months debugging production issues that should take minutes
❌ No Evaluation Framework
Manual testing doesn't scale. Non-determinism makes gut-feel unreliable. Agents work Monday, fail Friday, and teams have no systematic way to detect regressions.
Impact: Quality becomes invisible; teams can't tell if changes improve or degrade performance
❌ Architectural Over-Engineering
Teams build complex multi-agent architectures chasing sophistication, sacrificing maintainability and cost efficiency. Simple ReAct agents can match complex systems at 50% lower cost.
Impact: Higher costs, slower iteration, more failure modes, harder debugging
What Production Actually Looks Like
Before diving into solutions, let's ground ourselves in what successful production deployment actually achieves:
Enter the 12-Factor Methodology
The original 12-Factor App methodology emerged from Heroku's experience building thousands of production applications. It codifies best practices for building software-as-a-service apps that:
- Use declarative formats for setup automation
- Have a clean contract with the underlying operating system
- Are suitable for deployment on modern cloud platforms
- Minimize divergence between development and production
- Can scale up without significant changes to tooling, architecture, or development practices
These principles are directly applicable to AI agent systems, with agent-specific adaptations addressing the unique challenges of:
Traditional Software
- • Deterministic execution
- • Clear input/output relationships
- • Stack traces for debugging
- • Static code analysis
AI Agent Systems
- • Non-deterministic reasoning
- • Complex tool orchestration
- • Context-dependent behavior
- • Dynamic decision-making
The 12 Factors for AI Agents
Here's the framework we'll explore in depth:
Foundation
- 1. Codebase: Version control for agent logic
- 2. Dependencies: Models, prompts, tools as dependencies
- 3. Config: Environment-based configuration
- 4. Backing Services: Vector DBs, model APIs, tools
Operations
- 6. Processes: Stateless execution
- 7. Port Binding: Agent APIs
- 8. Concurrency: Multi-agent orchestration
- 9. Disposability: Graceful shutdown & recovery
- 10. Dev/Prod Parity: Testing across environments
Measurement
- 5. Build, Release, Run: Evaluation & deployment pipelines
- 11. Logs: Observability & tracing
- 12. Admin Processes: Fine-tuning, evals, optimization
Three Core Insights
The framework resolves three fundamental contradictions in agent development:
1. Observable Autonomy: Measurement Enables Freedom
Traditional approaches sacrifice reliability for autonomy or autonomy for reliability. The resolution: agents can explore freely when every decision is traced, measured, and recoverable.
Example: Wells Fargo's 245M interactions achieved through privacy-first architecture with complete observability—autonomy bounded by systematic measurement.
2. Earned Complexity: Prove Value Through Evaluation
Simple ReAct agents can match complex multi-agent systems at 50% lower cost. The resolution: start simple, add complexity only when measured evaluation proves the value.
Example: Each architectural layer justifies itself through improvement in production metrics, not theoretical sophistication.
3. Quality Velocity: Automated Measurement Accelerates Iteration
Fast iteration without evaluation leads to quality regressions. The resolution: automated evaluation and continuous monitoring create a quality feedback loop that accelerates iteration.
Example: Rely Health's 100× debugging improvement through observability infrastructure—speed became a quality feature when measurement became continuous.
Who This Ebook Is For
This is written for engineering leaders and senior engineers building production AI systems—CTOs, VPs of Engineering, Staff Engineers, and AI/ML Engineers at startups to mid-market companies deploying AI agents.
You'll Get the Most Value If You:
- ✓ Have invested in agent development but struggle with reliability and production readiness
- ✓ Need frameworks for systematic architecture, not more prompt engineering tips
- ✓ Want evaluation strategies that make quality measurable, not just vibes
- ✓ Require production patterns with evidence from enterprises running agents at scale
- ✓ Value technical credibility backed by research, benchmarks, and case studies
What's Ahead
Each chapter explores one factor in depth, with:
- Research-backed explanations citing academic papers, industry standards, and production case studies
- Code examples showing concrete implementation patterns
- Architecture diagrams illustrating system design
- Expert commentary synthesized from "AI That Works" podcast episodes and industry practitioners
- Production patterns with specific guidance from Wells Fargo, Rely Health, and other deployments
- Anti-pattern alerts highlighting common failures to avoid
"Read AI code. Don't blindly trust—validate outputs. Separation of concerns: extract → polish for quality. Build for production: focus on accuracy, observability, error handling."— Key insights from "AI That Works" Episode #27: No Vibes Allowed
Let's begin with the foundation: treating agent logic as code requiring proper version control and deployment practices.
Ready to Build Production-Grade Agents?
The next 12 chapters will transform how you think about agent development—from prompt engineering to systematic engineering, from demos to production reliability, from guesswork to measurement.
Factor 1: Codebase
One codebase tracked in version control, many deploys
TL;DR
- • Agent logic—prompts, tool definitions, orchestration code—must be versioned alongside application code, not managed in spreadsheets or databases
- • One codebase with multiple deployments (dev, staging, production) prevents configuration drift and enables rollback
- • Treating prompts as code enables code review, diff tracking, and systematic testing—critical for production reliability
The Problem: Prompts in Production Chaos
Most teams start agent development with prompts in code as string literals, then graduate to storing them in databases for "easier editing," then fragment them across Google Docs, Notion pages, and production databases. This path leads to chaos:
❌ Anti-Pattern: Database-Stored Prompts
The pitch: "Let's store prompts in the database so non-engineers can edit them without deployments."
The reality: No version control, no code review, no rollback capability. Production prompts diverge from development. Changes are untested and untracked.
The failure mode: Someone "improves" a prompt on Friday afternoon. It breaks production. Nobody knows what changed or how to revert.
The 12-Factor principle is clear: one codebase, many deploys. For AI agents, this means:
- Agent orchestration code lives in version control (Git)
- Prompt templates live in version control (as code files or config files)
- Tool definitions live in version control (JSON Schema, Python decorators, etc.)
- Agent configurations live in version control (YAML, TOML, etc.)
- The same codebase deploys to dev, staging, and production with environment-specific config
What Belongs in the Codebase
Agent Codebase Inventory
Core Logic
- • Agent orchestration code (Python, TypeScript, etc.)
- • Prompt templates with variable interpolation
- • Tool/function definitions and schemas
- • Workflow graphs and state machines
- • Memory management strategies
Configuration & Tests
- • Agent configuration files (model selection, parameters)
- • Evaluation datasets and test cases
- • CI/CD pipeline definitions
- • Deployment manifests (Docker, K8s)
- • Documentation and architecture diagrams
Implementation Pattern: Prompts as Code
Here's how to structure prompt templates in your codebase:
Version Control Workflow
Wells Fargo's 600+ AI use cases in production rely on systematic version control. Here's the recommended workflow:
1. Feature Branch Development
Create feature branches for agent changes. Prompt updates, tool additions, and orchestration logic all follow standard Git workflows.
# Edit prompts/tools/orchestration
git commit -m "Add semantic search tool to knowledge agent"
2. Code Review for Prompts
Treat prompt changes like code changes. Review for clarity, correctness, potential failure modes, and alignment with evaluation criteria.
Example: Reviewer catches that new tool description is ambiguous, leading to incorrect tool selection in 15% of test cases.
3. Automated Testing in CI
Run evaluation datasets against prompt changes. Quality gates prevent regressions from reaching production.
CI pipeline: lint prompts → run eval suite → check pass@1 threshold → deploy to staging
4. Deployment with Rollback
Deploy the same Git SHA to dev → staging → production. Rollback is a Git revert, not database archaeology.
When issues arise: git revert abc123 restores previous working state in minutes, not hours.
"Context engineering means managing research, specs, and planning for coding agents. Systematic workflow: spec → research → plan → execute."— From "AI That Works" Episode #27: No Vibes Allowed - Live Coding
Tool Definitions as Code
Function calling definitions should live in code, not databases. Here's the pattern:
Monorepo vs Polyrepo for Agents
For agent systems, the question arises: should agents live in the same repository as the application they serve, or in separate repositories?
Repository Strategy Decision Matrix
Monorepo (Recommended for Most)
- ✓ Pro: Agent changes deploy with app changes atomically
- ✓ Pro: Shared types and contracts stay in sync
- ✓ Pro: Single CI/CD pipeline tests integration
- ✓ Pro: Simpler dependency management
- ⚠ Con: Requires good monorepo tooling (nx, turborepo)
Polyrepo (For Large Orgs)
- ✓ Pro: Team autonomy for agent development
- ✓ Pro: Independent deployment cadence
- ⚠ Con: Contract drift between agent and app
- ⚠ Con: Complex versioning and compatibility testing
- ⚠ Con: Slower iteration on integrated features
Configuration Files: YAML vs TOML vs Code
Agent configuration should be declarative and version-controlled. Here's a comparison:
| Format | Pros | Cons | Best For |
|---|---|---|---|
| YAML | Human-readable, widely supported | Indentation-sensitive, no comments in some parsers | K8s deployments, CI/CD |
| TOML | Explicit structure, great for config | Less common, learning curve | Python projects (pyproject.toml) |
| Python/TS | Type safety, programmatic generation | Requires code execution to parse | Complex conditional config |
| JSON | Universal support, strict syntax | No comments, verbose for large configs | Tool schemas, API contracts |
Production Pattern: Semantic Versioning for Agents
Apply semantic versioning to agent releases to communicate impact clearly:
MAJOR version (v2.0.0)
Trigger: Breaking changes to agent behavior, tool interfaces, or output format
Example: Switching from ReAct to deliberative architecture, removing deprecated tools
MINOR version (v1.3.0)
Trigger: New capabilities, additional tools, improved prompts maintaining backward compatibility
Example: Adding new search tool while maintaining existing tool interfaces
PATCH version (v1.2.1)
Trigger: Bug fixes, prompt clarifications, performance improvements
Example: Fixing tool description ambiguity that caused selection errors
Takeaways
One Codebase, Many Deploys
- ✓ Store prompts, tools, and orchestration logic in version control, not databases
- ✓ Use template interpolation for prompts to enable testing and environment variation
- ✓ Apply standard Git workflows: feature branches, code review, automated testing, rollback capability
- ✓ Version agent releases semantically to communicate impact clearly
- ✓ Treat tool definitions as typed code with validation (Pydantic, TypeScript, etc.)
Factor 2: Dependencies
Explicitly declare and isolate dependencies
TL;DR
- • Model versions, prompt templates, and tool definitions are dependencies that must be explicitly declared and locked
- • Implicit dependencies on "latest" model versions or external APIs without contracts lead to silent production failures
- • Dependency isolation through virtual environments and lock files enables reproducible agent behavior
The Problem: "Latest" Model Drift
A common anti-pattern: calling gpt-5 without version pinning. OpenAI updates model weights periodically. Your agent works Monday, behaves differently Tuesday, and you have no idea what changed.
❌ Anti-Pattern: Implicit Model Dependencies
response = client.chat.completions.create(
model="gpt-5", # Which version? When did it change?
messages=messages
)
Failure mode: Model update changes behavior. Evaluation scores drop. Root cause is invisible because you don't know what version was running last week.
✓ Pattern: Explicit Model Pinning
response = client.chat.completions.create(
model="gpt-5-0613", # Explicit version, reproducible
messages=messages
)
Benefit: Deterministic model behavior. When you update the version, the change is explicit in version control and can be evaluated.
Agent Dependency Categories
AI agents have unique dependencies beyond traditional software packages:
Complete Agent Dependency Inventory
1. Model Dependencies
What: LLM versions, embedding models, fine-tuned models
Declaration: gpt-5-0613, text-embedding-3-small, explicit version hashes for custom models
2. Prompt Template Dependencies
What: System prompts, few-shot examples, instruction templates
Declaration: Version-controlled files, Git commit SHAs referencing specific prompt versions
3. Tool/API Dependencies
What: External APIs, vector databases, search engines, internal services
Declaration: API contract versions, OpenAPI specs, SDK versions with lock files
4. Framework Dependencies
What: LangChain, LlamaIndex, CrewAI, agent orchestration libraries
Declaration: langchain==0.1.0 in requirements.txt or package-lock.json
5. Data Dependencies
What: Evaluation datasets, knowledge base snapshots, few-shot example collections
Declaration: Dataset version hashes, data versioning systems (DVC, LakeFS)
Lock Files for Reproducibility
Lock files ensure every deployment uses identical dependency versions. For AI agents, this extends beyond Python packages:
Dependency Isolation Strategies
Isolation prevents dependency conflicts and enables parallel development:
| Isolation Method | Use Case | Tool |
|---|---|---|
| Virtual Environments | Python dependency isolation | venv, poetry, uv |
| Container Images | Complete runtime isolation | Docker, containerd |
| Monorepo Workspaces | Multi-agent dependency management | npm workspaces, pnpm, yarn |
| Model Registries | Fine-tuned model versioning | MLflow, Weights & Biases |
Managing External API Dependencies
Tools that agents call are dependencies with their own versioning and breaking changes. Protect against API drift:
1. API Contract Versioning
Store OpenAPI specs for external tools. Version them alongside your agent code. Automated tests detect breaking changes.
openapi: 3.0.0
info:
version: 2.1.0
paths:
/search:
post:
parameters: ...
2. SDK Version Locking
When tools provide SDKs, lock to specific versions. Test SDK upgrades in isolation before deploying.
sendgrid==6.9.7 # Email tool SDK
stripe==5.4.0 # Payment tool SDK
googlemaps==4.10.0 # Maps tool SDK
3. Tool Response Validation
Validate tool responses against expected schemas. Detect API changes that break assumptions.
class SearchResponse(BaseModel):
results: list[dict]
total_count: int
query_time_ms: float
# Runtime validation catches schema changes
validated = SearchResponse(**api_response)
"Dynamic schema generation and token-efficient tooling are critical. Context engineering means managing what goes into the prompt, including tool definitions."— From "AI That Works" Episode #25: Dynamic schema generation
Prompt Template Versioning
Prompt templates are code dependencies. Version them systematically:
Prompt Versioning Strategies
Git Tags for Major Versions
Tag prompt releases: prompts-v1.3.0
Lock file references tag. Rollback = checkout tag.
Content Hashing for Exact Matching
SHA256 hash of prompt content in lock file
Detects unintended prompt drift. Ensures exact reproduction.
Dataset Dependencies and DVC
Evaluation datasets are dependencies that need versioning. Large datasets don't belong in Git. Use Data Version Control (DVC):
# Track evaluation dataset
dvc add data/evaluation_v2.jsonl
# Commit .dvc file (small) to Git, store data remotely
git add data/evaluation_v2.jsonl.dvc .gitignore
git commit -m "Add evaluation dataset v2"
# Push data to remote storage (S3, GCS, Azure Blob)
dvc push
Detecting Dependency Drift
Even with locked dependencies, drift can occur. Automated detection is critical:
CI Pipeline Checks
- • Verify all lock file hashes match actual content
- • Test tool API contracts against live endpoints in staging
- • Compare model outputs to baseline evaluation results
- • Flag any dependency version mismatches between environments
Production Monitoring
- • Log actual model versions used for each request (not just requested version)
- • Monitor tool API response schemas for unexpected changes
- • Alert when evaluation metrics drift beyond thresholds
- • Track dependency resolution times (slow = potential service issues)
Upgrading Dependencies Safely
Dependencies must be upgraded eventually. Make it systematic:
Safe Dependency Upgrade Workflow
1. Isolated Testing
- • Create feature branch for dependency update
- • Update lock file with new version
- • Run full evaluation suite against new dependency
2. Regression Analysis
- • Compare evaluation metrics: old version vs new version
- • Investigate any significant changes in behavior
- • Document breaking changes and required adaptations
3. Staged Rollout
- • Deploy to dev → staging → canary (5% production) → full production
- • Monitor metrics at each stage before proceeding
- • Maintain rollback capability via previous lock file
Case Study: Model Version Drift at Scale
Takeaways
Explicitly Declare and Isolate Dependencies
- ✓ Pin model versions explicitly (gpt-5-0613, not gpt-5)
- ✓ Version prompt templates with Git tags or content hashes
- ✓ Lock tool API contracts and SDK versions in requirements files
- ✓ Track evaluation datasets and knowledge bases with DVC
- ✓ Validate tool responses against schemas to detect API drift
- ✓ Upgrade dependencies through isolated testing and staged rollouts
Factor 3: Config
Store config in the environment
TL;DR
- • Configuration that varies between environments (dev, staging, production) belongs in environment variables, not code
- • Model selection, API keys, temperature settings, timeout values should be configurable per deployment without code changes
- • Proper config management enables same codebase to run safely across development, staging, and production
The Problem: Config Sprawl
Teams often start with hardcoded config values in code. When they need environment-specific behavior, they add conditional logic: if env == "production". This spirals into unmaintainable config sprawl across multiple files and databases.
❌ Anti-Pattern: Config in Code
if os.getenv("ENV") == "production":
model = "gpt-5-0613"
temperature = 0.7
api_key = "sk-prod-..." # SECURITY RISK!
else:
model = "gpt-5-mini"
temperature = 1.0
api_key = "sk-dev-..." # LEAKED IN VERSION CONTROL!
Failures: (1) Secrets in code, (2) Adding staging requires code changes, (3) Config logic scattered across files
✓ Pattern: Config from Environment
from pydantic_settings import BaseSettings
class AgentConfig(BaseSettings):
model: str # From AGENT_MODEL env var
temperature: float = 0.7
openai_api_key: str # From OPENAI_API_KEY
class Config:
env_prefix = "AGENT_"
Benefits: (1) No secrets in code, (2) Same code runs everywhere, (3) New environments = new .env file
What Belongs in Config
Configuration is anything that varies between deployments. For AI agents, this includes:
Agent Configuration Taxonomy
Model Configuration
Model name/version, temperature, top_p, max_tokens, timeout, retry attempts
Example: Use faster/cheaper models in dev, production-grade models in production
API Keys & Credentials
OpenAI API key, database credentials, vector DB URL, external service tokens
Critical: Never commit secrets to version control. Use secret management systems.
Feature Flags
Enable/disable experimental features, A/B test variants, rollout percentages
Example: Enable new retrieval strategy for 10% of production traffic
Resource Limits
Rate limits, cost budgets, concurrency limits, memory constraints
Example: Dev: no rate limits. Production: 1000 requests/hour/user
Behavior Tuning
Escalation thresholds, confidence cutoffs, tool selection strategies
Example: Staging: escalate after 2 failures. Production: escalate after 5 failures
Environment-Based Config Pattern
Standard practice: .env files for local development, platform-provided env vars in production.
AGENT_TEMPERATURE=1.0 # Higher creativity in dev
AGENT_MAX_TOKENS=1024
AGENT_COST_LIMIT_USD=1.00 # Low limit in dev
OPENAI_API_KEY=sk-dev-...
VECTOR_DB_URL=http://localhost:6333 # Local Qdrant
ENABLE_EXPERIMENTAL_TOOLS=true
LOG_LEVEL=DEBUG
AGENT_TEMPERATURE=0.7 # Lower for consistency
AGENT_MAX_TOKENS=2048
AGENT_COST_LIMIT_USD=100.00 # Per-user daily limit
OPENAI_API_KEY=${'{SECRET_OPENAI_KEY}'} # From secret manager
VECTOR_DB_URL=https://prod.vectordb.company.com
ENABLE_EXPERIMENTAL_TOOLS=false
LOG_LEVEL=INFO
Type-Safe Config with Pydantic Settings
Using Pydantic Settings provides validation, type safety, and clear error messages:
Secret Management
API keys and credentials require special handling. Never commit them to version control:
| Environment | Secret Storage | Access Method |
|---|---|---|
| Local Dev | .env file (gitignored) | Loaded at app startup |
| CI/CD | GitHub Secrets, GitLab CI vars | Injected as env vars |
| Cloud Platforms | Platform secrets (Heroku Config, Vercel Env) | Runtime env vars |
| Kubernetes | K8s Secrets, Vault, AWS Secrets Manager | Mounted as volumes or env vars |
Feature Flags for Safe Rollouts
Feature flags enable decoupling deployment from release. Deploy code to production with features disabled, then enable via config:
Environment-Specific Behavior Patterns
Some behaviors should differ by environment. Make them configurable, not hardcoded:
Development: Fail Fast, Log Everything
ENABLE_STRICT_VALIDATION=true
FAIL_ON_TOOL_ERRORS=true # Don't hide errors
RATE_LIMIT_ENABLED=false # No throttling in dev
Staging: Production-Like, Isolated
ENABLE_STRICT_VALIDATION=true
FAIL_ON_TOOL_ERRORS=false # Graceful degradation
RATE_LIMIT_ENABLED=true
USE_PRODUCTION_DATA_SNAPSHOT=true # Test with real-ish data
Production: Resilient, Observable, Controlled
ENABLE_STRICT_VALIDATION=false # Allow some flexibility
FAIL_ON_TOOL_ERRORS=false
RATE_LIMIT_ENABLED=true
ENABLE_HUMAN_IN_LOOP=true # Critical decisions escalate
COST_BUDGET_ENFORCEMENT=strict
The .env.example Pattern
Commit a template showing what config is needed, without actual values:
AGENT_MODEL=gpt-5-0613 # or gpt-5-mini-0125 for dev
AGENT_TEMPERATURE=0.7
AGENT_MAX_TOKENS=2048
# API Keys (get from platform dashboards)
OPENAI_API_KEY=sk-... # From https://platform.openai.com
VECTOR_DB_URL=http://localhost:6333 # Local Qdrant
# Feature Flags
ENABLE_EXPERIMENTAL_TOOLS=false
ENABLE_HUMAN_IN_LOOP=true
# Observability
LOG_LEVEL=INFO # DEBUG | INFO | WARNING | ERROR
LOGFIRE_TOKEN= # Optional: from logfire.pydantic.dev
New developers copy .env.example to .env, fill in their own API keys, and start development without guessing what config is needed.
Configuration Validation at Startup
Fail fast if configuration is invalid. Don't wait for runtime errors:
Takeaways
Store Config in the Environment
- ✓ Use environment variables for all config that varies between deployments
- ✓ Validate config at startup with type-safe libraries (Pydantic Settings)
- ✓ Never commit secrets—use .env (gitignored) locally, secret managers in production
- ✓ Commit .env.example as template showing required configuration
- ✓ Use feature flags to decouple deployment from release
- ✓ Environment-specific behavior should be configured, not hardcoded with conditionals
Factor 4: Backing Services
Treat backing services as attached resources
TL;DR
- • Vector databases, model APIs, and tool APIs are backing services—resources attached via config, not hardcoded
- • Services should be swappable without code changes: ChromaDB ↔ FAISS, OpenAI ↔ Anthropic, local ↔ cloud
- • Abstraction layers enable testing with local services, deploying with production-grade infrastructure
The Problem: Hardcoded Service Dependencies
Teams often tightly couple agents to specific services: "We use OpenAI" becomes scattered import openai calls throughout the codebase. When you need to test locally, switch models, or migrate providers, you're rewriting code instead of changing config.
❌ Anti-Pattern: Hardcoded Service Coupling
import openai
import chromadb
openai_client = openai.OpenAI(api_key="sk-...")
chroma_client = chromadb.Client()
response = openai_client.chat.completions.create(...)
results = chroma_client.query(...)
Problem: Want to switch to Anthropic? Local ChromaDB to hosted Qdrant? You're grepping the codebase and refactoring dozens of files.
✓ Pattern: Service Abstraction with Config-Based Attachment
llm = get_llm_client(config.llm_provider) # From config
vector_db = get_vector_db(config.vector_db_url) # Swappable
response = llm.complete(prompt) # Same interface
results = vector_db.query(embedding) # Same interface
Benefit: Switch providers by changing LLM_PROVIDER=anthropic in .env. Code unchanged.
Agent Backing Services Taxonomy
AI agents depend on multiple categories of backing services:
Backing Service Categories
1. LLM/Model APIs
Examples: OpenAI, Anthropic, Google Vertex AI, Azure OpenAI, AWS Bedrock, local LLMs
Why attached resource: Switch models based on cost, latency, capability requirements. Test locally with smaller models, deploy with production-grade models.
2. Vector Databases
Examples: ChromaDB, FAISS, Qdrant, Pinecone, Weaviate, Milvus
Why attached resource: Local ChromaDB for dev, hosted Pinecone for production. Choose based on scale, latency, cost. Migrate without code changes.
3. External Tool APIs
Examples: Search APIs, CRM systems, databases, email services, payment processors
Why attached resource: Mock tool APIs in tests, sandbox APIs in staging, production APIs in prod. All via config.
4. Observability Services
Examples: Langfuse, Maxim AI, Arize Phoenix, Azure AI Foundry, Logfire
Why attached resource: Optional in dev, required in production. Self-hosted vs cloud. Swap platforms without code changes.
5. Data Stores
Examples: PostgreSQL, Redis, MongoDB, S3, knowledge graphs (Neo4j, TOBUGraph)
Why attached resource: Local database for dev, cloud database for prod. Session storage, knowledge bases, memory systems all configurable.
Service Abstraction Pattern
Create interface abstractions for each service category:
Vector Database Abstraction
Research shows ChromaDB excels for local development, FAISS for massive scale. Production teams need to switch based on requirements:
Vector Database Comparison
Based on research from research.md §5.5, here's how to choose vector databases as attached resources:
| Database | Best For | Performance | Features |
|---|---|---|---|
| ChromaDB | Local dev, prototypes, complete DB features | In-memory, swift access | Persistence, metadata filtering, full-stack |
| FAISS | Massive scale, strict latency, local execution | Sub-millisecond, GPU acceleration 5-10× faster | No persistence/transactions (library, not DB) |
| Qdrant | Production, cloud, advanced filtering | High throughput, distributed | Full-text search, payload indexing, clustering |
| Pinecone | Managed cloud, zero ops | Scalable, managed infrastructure | Serverless, automatic scaling |
Configuration-Based Service Attachment
Services attach via environment config, not code:
VECTOR_DB_TYPE=chroma
VECTOR_DB_URL=http://localhost:6333
OBSERVABILITY_ENABLED=false # Optional in dev
VECTOR_DB_TYPE=qdrant
VECTOR_DB_URL=https://prod.qdrant.company.com
OBSERVABILITY_ENABLED=true
OBSERVABILITY_PROVIDER=langfuse
LANGFUSE_PUBLIC_KEY=${'{SECRET_LANGFUSE_PUBLIC_KEY}'}
LANGFUSE_SECRET_KEY=${'{SECRET_LANGFUSE_SECRET_KEY}'}
Same code, different backing services. Switch from local ChromaDB to production Qdrant by changing two environment variables.
Tool API Abstraction for Testing
External tool APIs should be mockable in tests, sandbox-able in staging, production-real in prod:
Multi-Provider Fallback Pattern
Production agents can attach multiple providers for resilience:
LLM_SECONDARY_PROVIDER=anthropic
LLM_ENABLE_FALLBACK=true
Knowledge Graph Memory as Backing Service
Based on research (research.md §5.3), knowledge graph memory systems like Graphiti provide superior relationship modeling. Treat them as attached resources:
"Graphiti achieves extremely low-latency retrieval with P95 latency of 300ms through hybrid search combining semantic embeddings, keyword (BM25) search, and direct graph traversal—avoiding LLM calls during retrieval."— Research from Neo4j: Graphiti Knowledge Graph Memory
MEMORY_TYPE=knowledge_graph # or vector_db
KNOWLEDGE_GRAPH_URL=bolt://localhost:7687
KNOWLEDGE_GRAPH_USER=neo4j
KNOWLEDGE_GRAPH_PASSWORD=${'{SECRET_NEO4J_PASSWORD}'}
# Agent code - same interface for different memory backends
memory = get_memory_service(config.memory_type)
Observability as Optional Backing Service
Observability platforms (Langfuse, Maxim AI, Arize Phoenix) should be optional in development, required in production:
Development: OBSERVABILITY_ENABLED=false — no tracing overhead.
Production: OBSERVABILITY_ENABLED=true — full distributed tracing.
Health Checks for Attached Services
Validate that backing services are reachable at startup:
Service Migration Pattern
Migrating from one backing service to another requires systematic approach:
Backing Service Migration Steps
1. Implement New Adapter
Create adapter for new service implementing same interface. Add to factory function.
2. Dual-Write Phase (Optional for Stateful Services)
For data stores, write to both old and new services. Read from old service.
3. Migrate Data (If Applicable)
For vector DBs, backfill embeddings. For databases, migrate historical data.
4. Switch Read Traffic
Change config to read from new service. Monitor metrics closely.
5. Remove Old Service
After validation period, remove old service adapter and config.
Case Study: Wells Fargo's Modular Architecture
Takeaways
Treat Backing Services as Attached Resources
- ✓ Abstract services behind interfaces—same code, swappable backends
- ✓ Attach services via environment config, not hardcoded imports
- ✓ Enable testing with mocks, development with local services, production with cloud infrastructure
- ✓ ChromaDB for local dev, FAISS for massive scale, Qdrant/Pinecone for production—choose via config
- ✓ Multi-provider fallback (OpenAI → Anthropic on rate limit) for resilience
- ✓ Health check backing services at startup—fail fast if unavailable
Factor 5: Build, Release, Run
Strictly separate build and run stages
TL;DR
- • Evaluation-driven development transforms agent iteration from gut-feel to data-driven, enabling 5× faster shipping
- • Quality gates in CI/CD pipelines prevent regressions—prompt changes must pass evaluation before reaching production
- • Offline + online + continuous evaluation provides comprehensive quality assessment throughout the agent lifecycle
The Problem: "It Works on My Machine"
Agents work brilliantly in local testing. You push to production. Users report failures. You can't reproduce. The culprit: no separation between build (evaluation, testing) and run (production execution).
❌ Anti-Pattern: Manual Testing Only
The workflow: Developer changes prompt. Tests manually with 3-5 examples. "Looks good!" Ships to production.
The failure: Agent breaks on edge cases not tested manually. Pass@8 consistency is ~25% (τ-bench data), but manual testing never caught it.
The impact: Production regressions discovered by users, not before deployment.
✓ Pattern: Evaluation-Driven Workflow
The workflow: Developer changes prompt → Automated eval suite runs 100+ test cases → Quality gates check pass rate → Only ships if threshold met.
The benefit: Regressions caught in CI, not production. Data-driven decisions on prompt quality.
The evidence: Teams report 5× faster shipping with evaluation frameworks (Maxim AI data).
The Three Stages
For agents, the 12-Factor "build, release, run" stages map to evaluation and deployment:
Agent Lifecycle Stages
BUILD: Prompt Compilation & Offline Evaluation
What happens: Prompt templates compiled with variables. Tool schemas generated. Eval suite runs against test datasets. Code linted. Type checking passes.
Output: Validated agent artifact with evaluation metrics attached (pass@1: 87%, latency: 1.2s avg, cost: $0.03/query)
RELEASE: Deployment with Metadata
What happens: BUILD artifact + environment config → release. Git SHA + eval metrics + deployment timestamp tagged together.
Output: Immutable release package with full traceability (which code version, which evaluation results, which config)
RUN: Production Execution with Monitoring
What happens: Agent serves user requests. Online evaluation monitors production quality. Continuous evaluation tracks drift.
Output: Observable production behavior with real-time metrics and drift detection
"Continuous evaluation transforms AI agents from static tools into learning systems that improve over time."— Microsoft Azure AI: Continuously Evaluate Your AI Agents
Offline Evaluation: Pre-Deployment Quality Assessment
Offline evaluators assess quality during development before deployment, using test datasets:
LLM-as-a-Judge for Scalable Evaluation
Manual evaluation doesn't scale. LLM-as-judge enables automated quality assessment:
Quality Gates in CI/CD
Quality gates are checkpoints requiring minimum evaluation thresholds before proceeding:
PR cannot merge until evaluation passes quality gates. Regressions caught before merging.
Online Evaluation: Production Monitoring
Online evaluation runs in live production, detecting drift and unexpected behaviors:
Real-Time Quality Checks
Sample production traffic (e.g., 10%) for automated evaluation. LLM-as-judge runs asynchronously on sampled requests.
Detect quality drift before it becomes widespread user issue.
A/B Testing
Route 50% of traffic to prompt variant A, 50% to variant B. Online evaluation compares quality metrics across variants.
Data-driven decisions: which prompt performs better in production?
Human Feedback Collection
Thumbs up/down on agent responses. Feedback integrated into evaluation datasets for continuous improvement.
Production examples become test cases for future iterations.
Continuous Evaluation
Continuous evaluation transforms agents from static deployments into learning systems:
"After deploying applications to production with continuous evaluation setup, teams monitor quality and safety through unified dashboards providing real-time visibility into performance, quality, safety, and resource usage."— Azure AI Foundry: Observability in Generative AI
Continuous Evaluation Components
Scheduled Batch Evaluation
- • Nightly runs against full eval dataset
- • Track metrics over time (trend detection)
- • Alert on regression thresholds
Streaming Production Evaluation
- • Real-time sampling of production traffic
- • Immediate drift detection
- • Anomaly alerting (latency spikes, error rate increases)
RAGAS Framework for RAG Evaluation
For retrieval-augmented agents, RAGAS provides structured evaluation:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Faithfulness | How well answer is grounded in retrieved context | Prevents hallucination |
| Answer Relevancy | How directly answer addresses query | Measures usefulness |
| Context Precision | Quality of retrieved documents | Retrieval effectiveness |
| Context Recall | Completeness of retrieved information | Coverage assessment |
Case Study: Rely Health's Evaluation Infrastructure
Evaluation Dataset Management
Evaluation datasets are living artifacts requiring systematic management:
Dataset Curation
- • Start small: 20-50 diverse test cases covering main scenarios
- • Grow incrementally: Add production edge cases as failures occur
- • Category balance: Ensure coverage of tool selection, grounding, error handling, edge cases
- • Version with DVC: Track dataset changes alongside code changes
Production Examples → Test Cases
When production failures occur, systematically add them to eval datasets:
- 1. User reports incorrect agent behavior
- 2. Debug, identify root cause
- 3. Create test case from production example
- 4. Fix issue, verify test now passes
- 5. Test case prevents regression forever
Release Tagging with Evaluation Metadata
Every release should be tagged with evaluation results for traceability:
eval_results=$(python -m eval.run_offline_eval --output json)
# Create Git tag with eval metadata
git tag -a v1.4.0 -m "$(cat <<EOF
Release v1.4.0: Improved tool selection accuracy
Evaluation Results:
- Pass rate: 92.5% (up from 87.3%)
- Avg latency: 1.1s (down from 1.4s)
- Cost per query: $0.042 (down from $0.051)
- Tool selection accuracy: 96.2%
- Grounding score: 8.7/10
Changes:
- Refined system prompt for clearer tool descriptions
- Added few-shot examples for ambiguous queries
EOF
)"
Takeaways
Strictly Separate Build and Run Stages
- ✓ BUILD: Offline evaluation with test datasets, quality gates block bad code
- ✓ RELEASE: Tag deployments with evaluation metrics for traceability
- ✓ RUN: Online evaluation monitors production quality, continuous evaluation tracks drift
- ✓ Use LLM-as-judge for scalable automated evaluation
- ✓ Quality gates in CI/CD prevent regressions (min 85% pass rate, max latency, max cost)
- ✓ Transform production failures into test cases for continuous improvement
Factor 6: Processes
Execute the app as one or more stateless processes
TL;DR
- • Agent processes should be stateless—conversation state persists externally in databases, not in process memory
- • Tiered memory architecture (working/episodic/long-term) enables stateless agents with sophisticated context management
- • Stateless processes enable horizontal scaling, graceful restarts, and multi-instance deployments
Stateless Agent Execution
Agent processes should not store conversation state in memory. Load state from external storage at request start, persist changes at request end:
Tiered Memory Architecture
Based on research (§5.1), implement tiered memory for scalable state management:
Working Memory (Short-Term)
Rolling buffer of recent conversation (last 5-10 turns). Stored in Redis for fast access. Expires after session timeout.
Episodic Memory (Medium-Term)
Recent sessions (last 7-30 days). Stored in PostgreSQL. Used for context across sessions.
Long-Term Memory (Persistent)
Vector embeddings in ChromaDB/Qdrant or knowledge graphs (Graphiti). Semantic retrieval for relevant context.
Takeaways
Execute as Stateless Processes
- ✓ Agent processes stateless—load state from external storage, persist changes after execution
- ✓ Tiered memory (working/episodic/long-term) enables sophisticated context without process state
- ✓ Stateless design enables horizontal scaling and graceful restarts
Factor 7: Port Binding
Export services via port binding
TL;DR
- • Agents should expose APIs (REST, WebSocket) for interaction, not embed as libraries in larger applications
- • Self-contained agent services bind to ports, making them independently deployable and testable
Agent API Pattern
Takeaways
Export Services via Port Binding
- ✓ Agents expose REST/WebSocket APIs bound to configurable ports
- ✓ Self-contained services enable independent deployment and testing
- ✓ Port from environment config (PORT=8000), not hardcoded
Factor 8: Concurrency
Scale out via the process model
TL;DR
- • Run multiple agent instances for load distribution—parallel agents complete tasks 60% faster (12s vs 30s sequential)
- • Conditional parallel tool execution: read-only tools run concurrently, state-modifying tools run sequentially
Multi-Agent Orchestration Patterns
Based on research (§1.2), production systems combine orchestration patterns:
Orchestration Pattern Selection
Sequential Orchestration
Chain agents where each step builds on previous. Use for workflows with dependencies.
Example: Research → Analysis → Report generation
Parallel/Concurrent
Multiple agents work simultaneously. Performance: ~12s (6s concurrent + 6s synthesis) vs ~30s sequential.
Example: Parallel urgency/category/resolution analysis
Framework Comparison
| Framework | Best For | Architecture |
|---|---|---|
| LangGraph | Structured multi-agent coordination | Graph-based workflows with state management |
| CrewAI | Role-based teams, repeatable processes | Hierarchical, role-based |
| AutoGen | Complex, exploratory problem-solving | Conversational, iterative |
Takeaways
Scale Out via Process Model
- ✓ Run multiple agent instances for horizontal scaling
- ✓ Parallel execution for independent tasks (60% faster)
- ✓ Conditional execution: read-only concurrent, state-modifying sequential
- ✓ Framework choice: LangGraph (structured), CrewAI (role-based), AutoGen (exploratory)
Factor 9: Disposability
Maximize robustness with fast startup and graceful shutdown
TL;DR
- • Agents should handle interruptions and timeout gracefully with multi-layer degradation patterns
- • Fast startup and shutdown enable rapid iteration and deployment
Graceful Degradation Patterns
Based on research (§4.3), implement multi-layer degradation when components fail:
Layer 1: Premium AI Model
Try primary model (GPT-5) for complex requests
Layer 2: Smaller/Faster Model
On failure, fallback to GPT-3.5 or Claude Haiku
Layer 3: Rule-Based Backup
If all models fail, use deterministic rules for basic functionality
Layer 4: Cached Fallback
Emergency response with cached/static fallbacks
Circuit Breaker Pattern
When service failures reach threshold, circuit breaker "trips" and redirects calls to fallback operations until service recovers.
Takeaways
Maximize Robustness with Disposability
- ✓ Multi-layer degradation: premium model → faster model → rules → cache
- ✓ Circuit breaker pattern for service failures
- ✓ Never store corrupted data—skip failed items, log, continue
- ✓ Fast startup/shutdown enables rapid iteration
Factor 10: Dev/Prod Parity
Keep development, staging, and production as similar as possible
TL;DR
- • Test with same models, prompts, and tools across environments to prevent production surprises
- • Canary deployments starting at 1-5% traffic enable safe rollouts with observability
Environment Parity Principles
Dev/staging/production should differ only in config, not architecture or dependencies:
| Aspect | Dev | Staging | Production |
|---|---|---|---|
| Model Version | Same (gpt-5-0613) | Same | Same |
| Prompts | Same Git version | Same Git version | Same Git version |
| Tools | Mock/sandbox APIs | Staging APIs | Production APIs |
| Vector DB | Local ChromaDB | Hosted Qdrant | Hosted Qdrant |
Canary Deployment Pattern
Based on research (§6.3), staged rollout with monitoring:
Canary Rollout Steps
1. Deploy to 1-5% of traffic
Monitor decision-making quality, task completion, user satisfaction
2. Validate metrics stable
Compare canary metrics to baseline. Look for regressions.
3. Gradually increase (10% → 25% → 50% → 100%)
Expand percentage at each stage with monitoring
4. Automated rollback if metrics degrade
Feature flags enable instant rollback without redeployment
Takeaways
Keep Development and Production Similar
- ✓ Same models, prompts, tools across dev/staging/production (differ only in config)
- ✓ Test with production-like data in staging
- ✓ Canary deployments: 1% → 5% → 25% → 50% → 100% with monitoring
- ✓ Automated rollback via feature flags if metrics degrade
Factor 11: Logs
Treat logs as event streams
TL;DR
- • Distributed tracing with OpenTelemetry enables 100× faster debugging—transform agent failures from impossible to diagnose to 5-minute root cause analysis
- • Session-level and span-level tracing makes multi-step agent behavior visible, enabling systematic optimization
- • Observability is infrastructure, not afterthought—implement from day one for production readiness
The Problem: Black Box Failures
Agent fails in production. User reports "it gave me the wrong answer." You have no idea what happened: which tools were called, what the LLM reasoned, where the failure occurred. Without observability, agent debugging is archaeology through logs that don't exist.
❌ Anti-Pattern: Print Debugging at Scale
The approach: Scattered print statements and basic logging. "Let's add more logs to figure out what's happening."
The failure: Logs don't show reasoning flow. Multi-step agent execution is impossible to reconstruct. Tool call sequences are invisible.
The impact: Engineers manually test prompts for days trying to reproduce production issues.
✓ Pattern: Distributed Tracing from Day One
The approach: OpenTelemetry instrumentation capturing sessions, spans, LLM calls, tool executions. Visual trace analysis.
The benefit: Click on failed request → see complete execution trace → identify exact failure point in seconds.
The evidence: Rely Health achieved 100× faster debugging with observability infrastructure.
"Traditional observability relies on metrics, logs, and traces suitable for conventional software, but AI agents introduce non-determinism, autonomy, reasoning, and dynamic decision-making requiring advanced frameworks."— OpenTelemetry: AI Agent Observability - Evolving Standards
OpenTelemetry GenAI Semantic Conventions
OpenTelemetry has become the industry standard for distributed tracing. OpenLLMetry semantic conventions are now officially part of OpenTelemetry—a significant step in standardizing LLM application observability.
Session and Span Architecture
Agent observability uses hierarchical tracing:
Tracing Hierarchy
Session (Trace)
What: Complete agent task from user input to final response
Contains: All spans for single user request. Session ID links all related activity.
Metadata: User ID, session duration, total cost, final outcome (success/failure)
Span (Individual Step)
What: Individual operation within agent execution
Types: LLM call, tool execution, retrieval operation, reasoning step, validation check
Metadata: Span type, duration, input/output, tokens used, cost, errors
Instrumentation Pattern
Instrument agents with OpenTelemetry from first commit:
Observability Platform Comparison
Based on research (research.md §2.3), production-ready observability platforms:
| Platform | Best For | Key Features |
|---|---|---|
| Langfuse | Self-hosted, full control | Agent graphs, session tracking, datasets, prompt management |
| Arize Phoenix | Open-source, hybrid ML/LLM | OTLP tracing, LLM evals, span replay, no vendor lock-in |
| Maxim AI | Cross-functional teams | End-to-end eval + observability, agent simulation, no-code UI |
| Azure AI Foundry | Enterprise deployments | Unified dashboard, lifecycle evaluation, compliance features |
What to Log: Agent-Specific Telemetry
Traditional logs capture code execution. Agent logs capture reasoning, decisions, and context:
LLM Call Metadata
- • Model name and version
- • Full prompt (system + user messages)
- • Model parameters (temperature, max_tokens, top_p)
- • Response text and finish_reason
- • Token usage (prompt_tokens, completion_tokens, total_tokens)
- • Latency and cost
Tool Call Metadata
- • Tool name and function signature
- • Input parameters (sanitized if sensitive)
- • Tool execution result
- • Success/failure status
- • Execution duration
- • Any errors or exceptions
Decision Points
- • Tool selection reasoning (if available)
- • Confidence scores
- • Escalation triggers (when/why human-in-loop activated)
- • Fallback paths taken
- • Validation failures
Agent Graph Visualization
Langfuse's agent graph visualization illustrates complex agentic workflow flows:
Cost and Performance Dashboards
Real-time dashboards track critical production metrics:
Production Observability Metrics
Cost Metrics
- • Cost per session (avg, p50, p95, p99)
- • Cost by user, geography, model version
- • Daily/weekly/monthly burn rate
- • Cost anomaly detection
Performance Metrics
- • Latency (avg, p95, p99) by operation type
- • Success rate vs error rate
- • Tool call distribution
- • Token usage trends
Debugging Workflow with Traces
Observability transforms debugging from archaeology to systematic analysis:
Trace-Based Debugging Steps
1. Identify Failed Session
User report or monitoring alert → filter sessions by error status → click failed session
2. View Agent Graph
Visual representation shows exact failure point. Red node = failure. Timing shows bottlenecks.
3. Inspect Span Details
Click failing span → see exact LLM prompt, response, tool parameters, error message
4. Reproduce in Eval
Export failing case to evaluation dataset. Fix issue. Verify eval now passes.
5. Deploy with Confidence
Quality gates prevent regression. Observability confirms fix in production.
Sensitive Data Handling
Logging LLM prompts and responses requires careful handling of sensitive data:
Redaction Strategy
Automatically redact PII (emails, phone numbers, SSNs, credit cards) before logging. Use regex patterns or ML-based PII detection.
span.set_attribute("prompt_redacted", logged_prompt)
Sampling for Privacy
Log full prompts for small percentage of traffic (e.g., 1-5%). Capture metadata (token counts, latency, errors) for 100%.
Balance observability needs with privacy requirements
Access Controls
Restrict trace access to authorized engineers. Implement audit logging for who viewed which traces. Retention policies (auto-delete after 30-90 days).
"Observability reveals failure patterns during limited rollout instead of at full scale. Better observability reduces the probability of cascading failures."— From production debugging best practices
Alerts and Anomaly Detection
Proactive monitoring prevents issues from becoming incidents:
Threshold-Based Alerts
- • Error rate > 5% for 5 minutes → alert
- • P99 latency > 5 seconds → alert
- • Cost per session > $1.00 → alert
- • Daily cost > $500 → alert
Anomaly Detection
- • Statistical anomalies in latency, cost, token usage
- • Unusual tool call patterns
- • Sudden changes in error types
- • Model response length shifts (potential prompt drift)
Case Study: Rely Health's Observability Impact
Takeaways
Treat Logs as Event Streams
- ✓ Implement OpenTelemetry distributed tracing from day one—not after production failures
- ✓ Capture session-level and span-level telemetry (LLM calls, tool executions, reasoning steps)
- ✓ Use agent graph visualization to understand complex multi-step execution flows
- ✓ Monitor cost, latency, error rates with real-time dashboards and alerts
- ✓ Redact sensitive data, implement sampling, enforce access controls for privacy
- ✓ Platform choice: Langfuse (self-hosted), Arize Phoenix (open-source), Maxim AI (managed), Azure AI (enterprise)
Factor 12: Admin Processes
Run admin/management tasks as one-off processes
TL;DR
- • Model fine-tuning, prompt optimization, and evaluation runs should execute as separate one-off processes, not within serving infrastructure
- • Admin tasks require different resources than production serving—isolate to prevent resource contention
- • Scheduled evaluations and prompt experiments run as jobs, not continuous services
What Qualifies as Admin Process
For AI agents, admin processes include management tasks separate from serving user requests:
Agent Admin Process Categories
Model Operations
Fine-tuning models on custom datasets, evaluating new model versions, A/B testing model variants
Prompt Engineering
Systematic prompt optimization, few-shot example selection, prompt variant testing
Evaluation Runs
Batch evaluation against full test datasets, benchmark comparisons, regression testing
Data Management
Knowledge base updates, embedding regeneration, vector database maintenance, dataset curation
Berkeley Function Calling Leaderboard
The Berkeley Function Calling Leaderboard (BFCL) is the defacto standard for evaluating function calls—a perfect example of admin process for agent development:
Scheduled Evaluation Jobs
0 2 * * * /app/bin/run-nightly-eval.sh # run-nightly-eval.sh
#!/bin/bash
python -m eval.run_offline_eval \
--dataset data/eval_full.jsonl \
--output results/eval_$(date +%Y%m%d).json
python -m eval.compare_to_baseline \
--current results/eval_$(date +%Y%m%d).json \
--baseline results/baseline.json \
--alert-on-regression
Takeaways
Run Admin Tasks as One-Off Processes
- ✓ Model fine-tuning, prompt optimization, evaluation runs execute as separate jobs
- ✓ Scheduled evaluations (nightly, weekly) track quality over time
- ✓ Benchmark against standards (BFCL, τ-bench) as periodic admin tasks
- ✓ Separate admin process resources from production serving infrastructure
Conclusion: Building Reliable Agent Systems
The Path Forward
Great AI agents are not LLMs with tools. They're engineered systems requiring proper architecture, observability infrastructure, and systematic evaluation. The 12-Factor methodology adapted for agents provides a proven framework for production reliability.
The Three Core Principles
1. Observable Autonomy
Agents can explore freely when every decision is traced, measured, and recoverable. Autonomy becomes reliable when it's observable—the measurement infrastructure enables the freedom.
Example: Wells Fargo's 245M interactions with complete observability—autonomy bounded by systematic measurement
2. Earned Complexity
Start simple (ReAct), add complexity only when measured evaluation proves the value. Each architectural layer justifies itself through improvement in production metrics, not theoretical sophistication.
Evidence: Simple ReAct agents match complex systems at 50% lower cost (research data)
3. Quality Velocity
Automated evaluation and continuous monitoring create a quality feedback loop that accelerates iteration. Speed becomes a quality feature when measurement is continuous.
Impact: Rely Health's 100× debugging improvement—measurement infrastructure enables velocity
Your Implementation Checklist
Week 1: Foundation
- ☐ Move prompts and tool definitions to version control
- ☐ Implement environment-based config with Pydantic Settings
- ☐ Create .env.example template for team
- ☐ Abstract backing services (LLM provider, vector DB) behind interfaces
Week 2: Observability
- ☐ Implement OpenTelemetry instrumentation
- ☐ Choose observability platform (Langfuse, Arize Phoenix, Maxim AI, Azure AI)
- ☐ Add session and span tracking to agent execution
- ☐ Create cost and latency dashboards
Week 3: Evaluation
- ☐ Create initial evaluation dataset (20-50 diverse test cases)
- ☐ Implement offline evaluation with LLM-as-judge
- ☐ Add quality gates to CI/CD pipeline
- ☐ Set up continuous evaluation (nightly runs)
Week 4: Production Readiness
- ☐ Implement tiered memory architecture
- ☐ Add graceful degradation and error recovery
- ☐ Set up canary deployment pipeline
- ☐ Configure alerts and anomaly detection
"Production AI systems require systematic engineering: proper architecture, context optimization, robust evaluation, human oversight patterns, and obsessive focus on UX and observability—not just prompt engineering."— Key insight from production deployments
The Reality Check
40% of AI agent projects fail to reach production. The gap isn't your LLM choice or prompt engineering—it's architectural. Teams that succeed treat agents as engineering systems requiring measurement, iteration, and governance.
The evidence is clear:
- Wells Fargo: 600+ AI use cases in production, 245M interactions, 3-10× engagement increases
- Rely Health: 100× faster debugging, 50% reduction in doctors' follow-up times
- Industry data: 5× faster shipping with evaluation frameworks, pass@8 consistency below 50% without systematic testing
Where to Go Next
Join the Community
Workshop materials and community discussions:
- • 12-Factor Agents: github.com/humanlayer/12-factor-agents
- • AI That Works Podcast: boundaryml.com/podcast (Tuesdays 10 AM PST)
Essential Resources
Key frameworks and tools:
- • OpenTelemetry GenAI: opentelemetry.io/blog
- • Berkeley Function Calling Leaderboard: gorilla.cs.berkeley.edu
- • RAGAS Evaluation Framework: docs.ragas.io
Final Thoughts
Building production-ready AI agents is an engineering challenge, not a prompt engineering challenge. The teams succeeding at scale have internalized this truth: architecture matters, observability is non-negotiable, and evaluation transforms velocity.
Start simple. Measure everything. Add complexity only when evaluation proves the value. Treat agents as systems requiring systematic engineering, not magic requiring better prompts.
Your Next Steps
- Audit your current agent architecture against the 12 factors
- Implement observability this week—OpenTelemetry from day one
- Build your first evaluation dataset (20 test cases minimum)
- Join the community discussions and share your learnings
Production-ready agents are built, not prompted.
References & Sources
This ebook synthesizes research from academic papers, industry standards, production case studies, and open-source frameworks. All sources were validated for credibility and relevance to production agent deployment as of January 2025.
Agent Architecture Research
ReAct: Synergizing Reasoning and Acting in Language Models
Foundational pattern for agent reasoning and action loops.
URL: https://arxiv.org/abs/2210.03629
τ-Bench: Benchmarking AI Agents for Real-World Domains
Reveals pass@1 rates of ~61% (retail) and ~35% (airline), with pass@8 consistency dropping to ~25%.
URL: https://arxiv.org/pdf/2406.12045
AgentArch: Comprehensive Benchmark for Agent Architectures
Performance analysis of reactive, deliberative, and hybrid architectures.
URL: https://arxiv.org/html/2509.10769
12-Factor Agents Framework
Adaptation of 12-factor app methodology for AI agent production deployment.
URL: https://github.com/humanlayer/12-factor-agents
Observability & Monitoring
OpenTelemetry for Generative AI
Official GenAI semantic conventions for standardized LLM observability.
URL: https://opentelemetry.io/blog/2024/otel-generative-ai/
AI Agent Observability - Evolving Standards
W3C Trace Context, agent-specific telemetry, evaluation and governance beyond traditional observability.
URL: https://opentelemetry.io/blog/2025/ai-agent-observability/
Azure AI Foundry Observability
Enterprise observability with unified dashboards, lifecycle evaluation, continuous monitoring.
URL: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observability
Langfuse: AI Agent Observability
Open-source distributed tracing, agent graphs, session tracking, dataset management.
URL: https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
Arize Phoenix Documentation
OTLP tracing, LLM evaluations, span replay, no vendor lock-in.
URL: https://arize.com/docs/phoenix
Maxim AI: Agent Observability
End-to-end evaluation + observability, agent simulation, real-time dashboards.
URL: https://www.getmaxim.ai/products/agent-observability
Evaluation Frameworks
LLM Evaluation 101: Best Practices
Comprehensive guide to offline vs online evaluation, LLM-as-judge, continuous monitoring.
URL: https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges
LLM-as-a-Judge: Complete Guide
Scalable automated evaluation using LLMs to assess output quality.
URL: https://www.evidentlyai.com/llm-guide/llm-as-a-judge
RAGAS Documentation
RAG evaluation framework with faithfulness, relevancy, precision, recall metrics.
URL: https://docs.ragas.io/en/stable/
Azure AI: Continuously Evaluate Your AI Agents
Production continuous evaluation setup and quality monitoring.
URL: https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/continuous-evaluation-agents
Berkeley Function Calling Leaderboard
Defacto standard for evaluating function calling across AST, enterprise scenarios, multi-turn, agentic settings.
URL: https://gorilla.cs.berkeley.edu/leaderboard.html
Tool Use & Function Calling
Function Calling with LLMs
Best practices for tool definitions, JSON Schema, parameter validation.
URL: https://www.promptingguide.ai/applications/function_calling
MCPVerse: Real-World Benchmark for Agentic Tool Use
Performance degradation beyond 8-10 tools, model comparisons across tool counts.
URL: https://arxiv.org/html/2508.16260
JSON Schema for LLM Tools & Structured Outputs
Tool description optimization, runtime validation, dependency specification.
URL: https://blog.promptlayer.com/how-json-schema-works-for-structured-outputs-and-tool-integration/
Error Recovery and Fallback Strategies
Multi-layer degradation, circuit breakers, retry logic, graceful degradation patterns.
URL: https://www.gocodeo.com/post/error-recovery-and-fallback-strategies-in-ai-agent-development
Memory & Context Management
Build Smarter AI Agents: Manage Memory with Redis
Tiered memory architecture: working memory, episodic memory, long-term storage.
URL: https://redis.io/blog/build-smarter-ai-agents-manage-short-term-and-long-term-memory-with-redis/
Graphiti: Knowledge Graph Memory for Agentic World
P95 latency of 300ms through hybrid search, superior relationship modeling vs RAG.
URL: https://neo4j.com/blog/developer/graphiti-knowledge-graph-memory/
FAISS vs Chroma: Vector Storage Battle
Performance comparison: FAISS for massive scale/speed, ChromaDB for features/local dev.
URL: https://www.myscale.com/blog/faiss-vs-chroma-vector-storage-battle/
Top Techniques to Manage Context Lengths in LLMs
RAG, truncation, sliding window, compression, hybrid approaches for token optimization.
URL: https://agenta.ai/blog/top-6-techniques-to-manage-context-length-in-llms
Production Deployment Patterns
7 Best Practices for Deploying AI Agents in Production
Canary deployments, feature flags, automated rollback, monitoring strategies.
URL: https://ardor.cloud/blog/7-best-practices-for-deploying-ai-agents-in-production
Human-in-the-Loop for AI Agents: Best Practices
Approve/reject, edit state, review tool calls, failure escalation patterns.
URL: https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
How to Prevent Excessive Costs for Your AI Agents
Multi-level warnings, real-time enforcement, token optimization, caching strategies.
URL: https://dpericich.medium.com/how-to-prevent-excessive-costs-for-your-ai-agents-4f9623caf296
Microsoft: Taxonomy of Failure Modes in AI Agents
14 failure modes: system design flaws, inter-agent misalignment, task verification issues.
URL: https://www.microsoft.com/en-us/security/blog/2025/04/24/new-whitepaper-outlines-the-taxonomy-of-failure-modes-in-ai-agents/
Multi-Agent Frameworks
LangGraph Multi-Agent Systems
Graph-based workflows, state management, flexible control flows, human-in-loop integration.
URL: https://langchain-ai.github.io/langgraph/concepts/multi_agent/
CrewAI vs AutoGen: Framework Comparison
CrewAI for structured processes, AutoGen for exploratory problem-solving.
URL: https://www.helicone.ai/blog/crewai-vs-autogen
LangChain vs LlamaIndex: Detailed Comparison
LangChain for versatile AI pipelines, LlamaIndex for RAG and data retrieval.
URL: https://www.datacamp.com/blog/langchain-vs-llamaindex
Agent Orchestration Patterns in Multi-Agent Systems
Sequential, parallel, conditional patterns with performance comparisons.
URL: https://www.getdynamiq.ai/post/agent-orchestration-patterns
Governance & Compliance
NIST AI Risk Management Framework
Govern, Map, Measure, Manage functions for trustworthy AI development.
URL: https://www.nist.gov/itl/ai-risk-management-framework
ISO/IEC 42001: AI Management Systems
International standard for AI management with 38 controls, PDCA approach.
URL: https://www.iso.org/standard/42001
OWASP Top 10 2025 for LLM Applications
Security vulnerabilities including excessive autonomy, vector DB risks, prompt leakage.
URL: https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/
Production Case Studies
Wells Fargo: 245 Million Agent Interactions
Privacy-first architecture with 600+ AI use cases, zero PII exposure to LLM.
URL: https://venturebeat.com/ai/wells-fargos-ai-assistant-just-crossed-245-million-interactions-with-zero-humans-in-the-loop-and-zero-pii-to-the-llm
Wells Fargo Brings Agentic Era to Financial Services
Google Cloud Agentspace deployment across contract management, FX operations, customer service.
URL: https://cloud.google.com/blog/topics/financial-services/wells-fargo-agentic-ai-agentspace-empowering-workers
How Rely Health Deploys Healthcare AI Solutions 100× Faster
100× debugging improvement, doctors' follow-up times cut by 50%, care navigator expansion.
URL: https://www.vellum.ai/blog/how-relyhealth-deploys-healthcare-ai-solutions-faster-with-vellum
Note on Research Methodology
Sources were selected based on: (1) technical credibility (peer-reviewed papers, established frameworks, production deployments), (2) recency (2024-2025 research prioritized for current best practices), (3) practical applicability (production-proven patterns over theoretical approaches), and (4) empirical evidence (benchmarks, case studies, measured outcomes). All URLs were validated as accessible as of January 2025. Research conducted between October 2024 and January 2025.