Observability for Agentic Systems: What to Log, How to Redact, How to Debug
Created on 2025-09-30 08:31
Published on 2025-10-02 01:30
It’s 3am. Your AI agent just approved a $50,000 refund to the wrong customer. You get paged. You open the logs. There are none.
No record of what prompt the agent saw. No trace of which tool it called. No explanation for why it chose that action. Just a catastrophic outcome and zero forensics.
This is the observability crisis in agentic AI.
Agents aren’t like traditional software. They don’t follow fixed code paths. They reason, they decide, they adapt. Every decision is a black box unless you explicitly log it. And when things go wrong—and they will—you need to know exactly what the agent was thinking, what it tried, and why it failed.
In 2025, observability isn’t optional for AI agents. It’s the difference between a system you can trust and a liability you can’t explain. IBM reports that AI agent observability is now “essential” for reliability, and Gartner projects that by 2028, 90% of production AI agents will require dedicated observability platforms (https://www.ibm.com/think/insights/ai-agent-observability).
Let me show you what to log, how to protect sensitive data, and how to debug when your agent inevitably does something unexpected.
The Problem: Agents Are Non-Deterministic Black Boxes
Traditional software is predictable: same input → same code path → same output. You log the input, the output, and maybe some checkpoints in between. Debugging is tracing execution through known logic.
AI agents are different:
-
Non-deterministic: Same prompt can produce different decisions (model sampling, context changes, tool availability).
-
Reasoning-based: The agent doesn’t execute pre-written logic—it generates a plan on the fly.
-
Tool-using: Calls external APIs, databases, browsers. Each tool call adds uncertainty.
-
Multi-agent: Supervisor delegates to workers. Decisions cascade across agents. One agent’s output is another’s input.
-
Context-dependent: What the agent “knows” depends on what’s in its context window (conversation history, retrieved docs, tool results).
You can’t debug this with print(“got here”). You need structured, comprehensive logging that captures the agent’s reasoning process, not just its outputs.
According to OpenTelemetry’s 2025 AI agent observability standards, “without proper monitoring, tracing, and logging mechanisms, diagnosing issues, improving efficiency, and ensuring reliability in AI agent-driven applications will be challenging” (https://opentelemetry.io/blog/2025/ai-agent-observability/).
What to Log: The 7 Critical Dimensions
Here’s what you need to capture for every agent interaction:
1. Prompts (What the Agent Saw)
Log the full system prompt and user input that the agent received. Not just “user asked to deploy”—the entire prompt, including:
-
System instructions
-
Tool descriptions
-
Conversation history (if any)
-
Retrieved context (RAG results, docs)
-
User’s exact message
Why: You need to know what the agent “saw” to understand why it made a certain decision. If the prompt was incomplete or misleading, that’s your root cause.
Example log entry:
2. Decisions & Reasoning (Why It Did What It Did)
Log the agent’s chain-of-thought or reasoning trace. Modern models can emit reasoning steps (like OpenAI’s o1 reasoning tokens). Capture them.
If the agent says “I need to run tests before deploying,” log that. If it says “User didn’t specify environment, assuming production,” log that. This is the decision narrative.
Why: When the agent makes a bad call, you need to know why it thought that was the right move. Was the reasoning flawed? Did it misunderstand the prompt? Did it lack information?
Example:
3. Tool Calls (What Actions It Took)
Log every tool invocation:
-
Tool name
-
Arguments (what the agent passed to the tool)
-
Timestamp
-
Trace ID (to correlate with tool results)
Why: Tools are where agents interact with the real world. If the agent called the wrong API, passed bad args, or skipped a critical tool, this is where you’ll see it.
Example:
4. Tool Results (What Happened)
Log the outcome of every tool call:
-
Status (success / failure)
-
Output (summary or full result, depending on verbosity)
-
Latency (how long it took)
-
Error message (if failed)
Why: The agent’s next decision depends on tool results. If a tool returned an error but the agent didn’t handle it, that’s your bug.
Example:
5. Agent State (What It Knew)
Log snapshots of the agent’s internal state at key points:
-
Active variables (e.g., environment=production)
-
Memory contents (conversation history, facts accumulated)
-
Context window size
-
Flags (e.g., tests_passed=true)
Why: State bugs are common. The agent thinks it’s in dev but it’s actually in prod. It “forgot” a critical piece of info because context got truncated. Logging state lets you catch these issues.
Example:
6. Errors & Failures (What Went Wrong)
Log all errors, retries, and failure modes:
-
Exception type and message
-
Stack trace
-
Retry attempts
-
Fallback actions (if agent recovered)
Why: Errors are where debugging starts. You need to see what failed, how many times the agent retried, and whether it handled the error gracefully or spiraled.
Example:
7. Outcomes (What the User Got)
Log the final result returned to the user:
-
Response text
-
Status (success / failure)
-
Session summary
-
Metrics (total tokens, total cost, total latency)
Why: This is your success/failure signal. Did the user get what they asked for? If not, you trace backward from here through the decision chain.
Example:
Metadata: The Glue That Makes Logs Queryable
Every log entry needs these fields:
-
run_id: Unique ID for this agent invocation (links all events in one session)
-
session_id: If multi-turn conversation, links across runs
-
agent_id: Which agent (supervisor, test_worker, deploy_worker)
-
user_id: Who triggered this (for access control, audit)
-
trace_id: Links distributed calls (agent A calls agent B)
-
timestamp: ISO 8601 UTC (critical for timeline reconstruction)
-
environment: dev / staging / production
Without these, your logs are unsearchable. With them, you can query: “Show me all production deployments where test_agent failed in the last 24 hours.”
The PII Challenge: How to Redact Sensitive Data
AI agents process real user data: names, emails, credit cards, health records. You need to log decisions, but you can’t log PII.
According to Kong’s 2025 guide on PII sanitization for agentic AI, “LLMs can memorize and regurgitate this data in unrelated contexts, especially if that data appears frequently in your prompts or agent memory” (https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai).
Redaction Strategy
1. Detect PII Before Logging
Use regex patterns and NER (named entity recognition) models to identify:
-
Emails: b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b
-
Phone: bd{3}[-.]?d{3}[-.]?d{4}b
-
SSN: bd{3}-d{2}-d{4}b
-
Credit cards: bd{4}[s-]?d{4}[s-]?d{4}[s-]?d{4}b
-
Names, addresses (use NER: spaCy, AWS Comprehend, Azure Text Analytics)
2. Replace with Tokens
Don’t just strip PII—replace it with pseudonymous tokens so logs are still readable:
Store a mapping (EMAIL_42 → [email protected]) in a separate, encrypted PII vault with strict access controls. Most logs never need the mapping. Only incident investigations require it.
3. Separate PII Vault
From your research on policy-driven systems:
4. Log Access to Logs
For compliance (GDPR, HIPAA), you need to track who viewed which logs. If an engineer pulls logs containing PII tokens, log that access event.
Example redacted log:
How to Debug: The 5-Step Investigation Flow
When an agent misbehaves, here’s how you trace it:
Step 1: Alert Triggers
Set up monitors for:
-
Error rate: >5% of runs fail
-
Latency: P95 response time >10s
-
Cost spike: Token usage 2× above baseline
-
Policy violations: Agent called a forbidden tool, accessed restricted data
-
Outcome failures: User goal not achieved (e.g., deployment didn’t complete)
When the alert fires, you get: run_id, agent name, timestamp, metric value.
Step 2: Find the Session
Query logs for that run_id:
Or use your observability platform (Datadog, Langfuse, Azure AI Foundry). According to Datadog’s 2025 announcement, their “AI Agent Monitoring instantly maps each agent’s decision path–inputs, tool invocations, calls to other agents and outputs–in an interactive graph” (https://www.datadoghq.com/about/latest-news/press-releases/datadog-expands-llm-observability-with-new-capabilities-to-monitor-agentic-ai-accelerate-development-and-improve-model-performance/).
Step 3: Trace the Decision Path
Reconstruct the timeline:
Now you can see: tests passed, build succeeded, but K8s API was down. The agent retried 3 times (correct behavior). Root cause: infrastructure, not agent logic.
Step 4: Compare to Baseline
Pull logs from successful runs with the same prompt:
-
Did successful runs use a different tool?
-
Was the prompt subtly different?
-
Did they have more context (longer history)?
-
Was the timing different (K8s down during this window)?
Anomaly detection tools (Arize Phoenix, Fiddler) can automate this: “This run’s tool call pattern differs from 98% of past runs.”
Step 5: Reproduce & Fix
The goal: deterministic replay. As noted in your research, “Any loop should be replayable from journal + artifacts + snapshots.”
Store enough detail that you can replay the run locally:
-
Full prompt
-
Tool outputs (not just summaries)
-
Agent state at each step
-
Artifacts (screenshots, HAR files, console logs)
Then:
-
Replay the run with the same inputs
-
See if it fails the same way (confirms root cause)
-
Test your fix (e.g., better retry logic, fallback to secondary K8s cluster)
-
Replay again to verify fix works
-
Deploy fix
Structured Logging: The NDJSON Standard
Your logs should be newline-delimited JSON (NDJSON), not plain text or CSV. Here’s why:
-
One event per line: Easy to append (no array closing bracket)
-
Streamable: Process logs as they’re written, no buffering
-
Tool-friendly: jq, grep, awk work natively
-
Structured: Every field is typed (strings, numbers, booleans, objects)
-
Queryable: Easy to filter by field (jq ‘select(.status == “error”)’)
OpenTelemetry’s 2025 AI agent semantic conventions standardize this format, ensuring interoperability across tools (https://opentelemetry.io/blog/2025/ai-agent-observability/).
Sample Log Schema
Here’s a minimal NDJSON schema for agent logs:
Real-World Example: Debugging a Deployment Failure
Let’s walk through a real debugging session using structured logs.
Scenario
User reports: “I asked the agent to deploy to production, but it deployed to staging instead.”
Investigation
Step 1: Find the session
User provides timestamp: 2025-09-30 14:32 UTC. Query logs:
Found run_id: r-501.
Step 2: Pull all events for that run
Step 3: Reconstruct timeline
Root cause found: User said “deploy the app” without specifying environment. Agent checked deploy.yaml, which had default_env: staging. Agent deployed to staging (correctly following the config, but not user intent).
Fix: Update agent prompt to explicitly ask: “Which environment? (production/staging)” if user doesn’t specify. Or change deploy.yaml to default to production (risky!).
This took 5 minutes because we had structured logs. Without them, it would be “the agent is broken, no idea why.”
Sample Log Timeline: Normal Deployment
Here’s a complete NDJSON log sequence for a successful deployment (15 events):
What this shows: Complete decision path from user request → routing → supervisor planning → test → build → deploy → verification → response. Every agent call, every tool invocation, every decision. Total cost: $0.042. Total time: 3 minutes 13 seconds.
Tools Landscape: What’s Available in 2025
You don’t have to build logging infrastructure from scratch. Here are the leading platforms:
Enterprise Solutions
Azure AI Foundry (https://azure.microsoft.com/en-us/blog/agent-factory-top-5-agent-observability-best-practices-for-reliable-ai/) Unified solution for evaluating, monitoring, tracing, and governing AI systems end-to-end. Integrates with Azure OpenAI, Azure ML, and Bedrock Agents.
Datadog AI Agent Monitoring (https://www.datadoghq.com/about/latest-news/press-releases/datadog-expands-llm-observability-with-new-capabilities-to-monitor-agentic-ai-accelerate-development-and-improve-model-performance/) Interactive decision graph showing inputs, tool invocations, agent-to-agent calls, outputs. Real-time metrics (cost, latency, error rate) + alerting.
Dynatrace AI Observability (https://www.dynatrace.com/news/blog/ai-agent-observability-amazon-bedrock-agents-monitoring/) Specializes in Amazon Bedrock Agents monitoring. Traces agentic workflows, detects drift, optimizes at scale.
IBM Watson AI Observability (https://www.ibm.com/think/insights/ai-agent-observability) Continuous monitoring, tracing, and logging across the AI agent lifecycle—development, testing, deployment, operation.
Open Source / Developer-Focused
Langfuse (https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse) Open-source, self-hostable. Detailed tracing of agent interactions, tool calls, and LLM outputs. Analytics and evaluation built-in. Great for teams prioritizing data control.
AgentOps (https://www.akira.ai/blog/langsmith-and-agentops-with-ai-agents) Visual timeline of agent events. Multi-agent workflow visualization. SDK instruments agents to log every event (prompts, LLM calls, tool invocations, errors).
Arize Phoenix (https://www.getmaxim.ai/articles/top-5-tools-to-monitor-ai-agents-in-2025/) Open-source. Drift detection, explainable AI, root-cause analysis. Ideal for technical teams with hybrid deployments.
OpenTelemetry for AI Agents (https://opentelemetry.io/blog/2025/ai-agent-observability/) Emerging standard for AI agent observability. Semantic conventions for prompts, tool calls, agent decisions. Works with any observability backend (Prometheus, Grafana, Jaeger, etc.).
Specialized Tools
Fiddler AI (https://www.fiddler.ai/blog/agentic-observability-development) Focus: monitoring and controlling agentic applications. Structured log ingestion from AWS Bedrock, LangGraph, custom agents. Real-time quality checks (hallucinations, PII leaks, policy violations).
LangSmith (https://www.akira.ai/blog/langsmith-and-agentops-with-ai-agents) Built by LangChain team. Native integration with LangChain/LangGraph agents. Tracing, prompt versioning, evaluation datasets.
Implementation Guide: Adding Observability to Your Agent
Here’s a practical 5-step plan to instrument your agent system:
Step 1: Choose Your Log Format & Storage
-
Format: NDJSON (one event per line)
-
Storage: File system (for dev), S3 / Azure Blob (for production), or a log aggregator (Datadog, Splunk, Elasticsearch)
-
Retention: 30 days for hot logs (fast queries), 1 year for cold logs (compliance)
Step 2: Instrument Your Agent Code
Add logging calls at every decision point. Example (Python):
Step 3: Add PII Redaction
Intercept logs before writing. Apply regex patterns:
Step 4: Set Up Monitoring & Alerts
Query your logs to generate metrics:
-
Error rate: Count of event=error / total runs
-
P95 latency: 95th percentile of total_latency_ms from session_end events
-
Cost per run: total_cost_usd from session_end
-
Tool success rate: tool_result.status=success / total tool_call events
Pipe these into Prometheus, Grafana, or your observability platform. Set thresholds and create alerts.
Step 5: Build a Debug Dashboard
Create a simple web UI that:
-
Lists recent runs (run_id, timestamp, status, latency, cost)
-
Drills down into a run (shows full event timeline)
-
Filters by agent, status, user, date range
-
Highlights errors in red
Tools like Langfuse and AgentOps provide this out-of-the-box. Or build your own with a static HTML page + JavaScript that reads NDJSON.
Anti-Patterns to Avoid
-
Logging only errors: You need successful runs for baseline comparison. Log everything.
-
Unstructured logs: Plain text (“Agent called test agent”) is unsearchable. Use NDJSON.
-
Missing trace IDs: Multi-agent systems are impossible to debug without linking events across agents.
-
No PII redaction: You’ll fail compliance audits. Redact at ingestion time, not retroactively.
-
Logging raw tool outputs: A 500-line API response bloats your logs. Log summaries. Store full output in artifacts if needed for replay.
-
No retention policy: Logs grow forever. Archive old logs (S3 Glacier, Azure Archive). Hot logs: 30 days. Cold logs: 1 year.
-
Ignoring cost: If you log every token, logs can exceed your LLM costs. Be strategic. Full prompt for errors; summary for successes.
The Business Case for Observability
Observability isn’t just a debugging tool. It’s a business enabler:
-
Trust & compliance: Regulated industries (finance, healthcare) require audit trails showing why AI made each decision. According to a Fiddler AI guide, “This level of traceability is only possible when logs are enriched with high-quality, structured metadata” (https://www.fiddler.ai/blog/monitoring-controlling-agentic-applications).
-
Cost control: Without logging, you can’t see which prompts or tools are expensive. Logs let you optimize (cache frequent queries, use cheaper models for simple tasks).
-
Quality improvement: Logs become training data for evals. “This prompt caused 20 failures” → update prompt, add guardrails.
-
Incident response: MTTR (mean time to recovery) drops from hours to minutes when you have full traces.
-
User trust: When users report “the agent did something weird,” you can show them exactly what happened and why. Transparency builds trust.
Research from 2025 shows that multi-agent systems require 26× more monitoring resources than single-agent applications (https://www.getmaxim.ai/articles/top-5-tools-to-monitor-ai-agents-in-2025/). This isn’t optional complexity—it’s necessary infrastructure for production AI.
The Bigger Picture
Observability for agentic systems is where DevOps was 15 years ago: transitioning from “we’ll figure it out when it breaks” to “we instrument everything and catch issues before users do.”
The difference: agents fail in new ways. They don’t crash—they hallucinate. They don’t throw exceptions—they misinterpret prompts. They don’t have stack traces—they have decision chains.
Traditional logging (input → output) isn’t enough. You need to log reasoning, state, and intent. You need to redact PII while preserving debuggability. You need to trace decisions across multiple agents. You need structured logs that are queryable, replayable, and compliant.
The teams that master this will ship reliable, auditable, trustworthy AI agents. The teams that skip it will spend months debugging black-box failures.
Observability isn’t optional. It’s the foundation of production agentic AI.
Ready to instrument your agents? Start with the NDJSON schema above. Add logging to one agent. Capture prompts, decisions, tool calls, and results. Redact PII. Query the logs when something breaks. You’ll never go back to blind debugging.
References:
-
Azure AI Foundry Observability: https://azure.microsoft.com/en-us/blog/agent-factory-top-5-agent-observability-best-practices-for-reliable-ai/
-
OpenTelemetry AI Agent Standards: https://opentelemetry.io/blog/2025/ai-agent-observability/
-
IBM AI Agent Observability: https://www.ibm.com/think/insights/ai-agent-observability
-
Datadog AI Agent Monitoring: https://www.datadoghq.com/about/latest-news/press-releases/datadog-expands-llm-observability-with-new-capabilities-to-monitor-agentic-ai-accelerate-development-and-improve-model-performance/
-
Kong PII Sanitization for Agentic AI: https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai
-
Fiddler AI Monitoring Guide: https://www.fiddler.ai/blog/monitoring-controlling-agentic-applications
-
Dynatrace Bedrock Agents Monitoring: https://www.dynatrace.com/news/blog/ai-agent-observability-amazon-bedrock-agents-monitoring/
-
Langfuse AI Agent Observability: https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
-
Top 5 AI Agent Monitoring Tools 2025: https://www.getmaxim.ai/articles/top-5-tools-to-monitor-ai-agents-in-2025/
Previous Post
Micro-Agents, Macro-Impact: Why Small, Composable AI Agents Beat One Mega-Brain
Next Post
MCP as the Tool Belt Standard: Giving AI Agents Hands and Eyes
Leave a Reply