Agentic Coding, Plain and Spicy

Created on 2025-09-30 05:45

Published on 2025-09-30 05:50

Agents don’t hallucinate less. They loop better.

The difference between a chatbot that writes code once and an agentic system that ships working code is four verbs: plan, act, reflect, recover.

The Problem: One-Shot LLMs Hit Walls

You’ve seen it. You ask Claude or GPT to build a feature. It writes 200 lines. You run it. Three import errors, one logic bug, and a silent failure in production. The model can’t see its own output, can’t run the tests, can’t read the logs. It guessed and walked away.

Non-agentic coding is like hiring a contractor who emails you blueprints, then ghosts. Agentic coding is hiring someone who shows up, measures twice, checks the pipes, and stays until the lights turn on.

The Idea: Plan → Act → Reflect → Recover

An agentic system doesn’t write code and vanish. It closes a loop—and here’s the crucial part: the loop isn’t code infrastructure, it’s prompting architecture.

Each phase is an instruction to the LLM, telling it what to observe, what to write, and how to respond:

Plan: The agent writes its own PRD, architecture, and task breakdown. “I need to add a webhook route, wire the DB session, add validation middleware, write tests, and verify with Playwright.” This isn’t a human writing requirements—the agent is authoring its own plan in markdown.
Act: Execute one task. Write code. Push a commit. Run pytest. Deploy. The agent doesn’t just suggest—it does.
Reflect: Observe everything. Read test output, parse exceptions, interrogate the running app with MCP Playwright (click buttons, fill forms, take screenshots), check the JavaScript console, read server logs, grep for errors. If clarity is low, add more debug code as a temporary test harness. The agent creates its own instrumentation.
Recover: Diagnose what failed and generate the next action. “The DB port is wrong in .env—fix it.” “Import missing—add it.” “Button isn’t appearing—screenshot shows layout issue, adjust CSS.” The agent doesn’t escalate to a human—it fixes and retries.

Then loop. Repeat until all checks pass—or hit a budget/iteration cap.

This isn’t AGI magic. It’s structured persistence. The model gets to see its own work, learn from failures in real-time, and improve its approach—just like a developer would, but faster and without fatigue.

Why Prompting-as-Architecture Is the Breakthrough

Here’s what makes this so different from traditional automation:

The agent writes its own spec. You don’t hand it a 47-point PRD. You give it a goal: “Build a webhook endpoint with validation and deploy it.” The agent breaks that down into subtasks, writes them into a task.md file, and ticks them off one by one. As it works, it updates the plan. It discovers edge cases you didn’t think of. It refactors its own approach mid-flight.

The agent instruments its own debugging. When a test fails with a cryptic error, the agent doesn’t throw up its hands. It adds console.log statements, injects debug routes into FastAPI, writes a temporary Python script to probe the database, or uses MCP Playwright to drive the browser and screenshot every state transition. It creates the observability it needs, uses it, then cleans it up.

The agent composes its own test harness. I watched Claude Code build a parallel test version of a PHP API in FastAPI—just so it could test components in isolation. It wrote a mock client, fired requests at both the real and test endpoints, compared outputs, identified the discrepancy, then threw the test harness away. It invented its own QA infrastructure on the fly, used it for ten minutes, and deleted it when done.

The loop is self-correcting, not pre-scripted. Traditional CI/CD fails fast and stops. Agentic loops fail fast and iterate. The agent sees “connection refused,” checks the .env file, fixes the port, reruns, sees “auth failed,” realizes the secret is base64-encoded, decodes it, reruns, sees green. Each failure is a clue, not a dead end.

What This Looks Like: Instructions, Not Infrastructure

You don’t need a framework. You need a prompt template that tells the agent how to loop. Here’s the skeleton in plain English (the “code” is just markdown the agent reads):

That’s it. You paste this into your agent’s system instructions, give it a goal, and let it run. The “code” is the prompt. The infrastructure is the LLM + tool access. The magic is that the agent authors its own process as it goes.

The Proof: Real Logs from Claude Code

I’ve run this pattern with Claude Code. Here’s a compressed timeline from a real session:

10:07 — Agent writes FastAPI route. Pushes code.
10:08 — Runs pytest. Three failures: missing import, wrong status code, DB connection error.
10:09 — Agent adds from app.db import get_session, changes 200 to 201.
10:10 — Reruns pytest. Two failures. DB still failing.
10:11 — Agent adds debug log to connection code. Reruns.
10:12 — Sees “connection refused” in logs. Agent checks .env file, finds DB_PORT is 5433 not 5432.
10:13 — Fixes port. Reruns pytest.
10:14 — All green. Agent commits: “Add webhook endpoint with validation.”

Seven minutes. Zero human intervention. The agent saw its own failures and recovered. That’s the loop in action.

Why This Is So Powerful

Agentic loops aren’t just faster automation. They’re a qualitative shift in how AI systems work. Here’s why this matters:

1. Agents Learn from Their Own Mistakes in Real-Time

Traditional systems fail once and stop. Agentic systems fail, analyze the failure, hypothesize a fix, try it, and repeat. Each iteration narrows the solution space. A human developer does this naturally—read the error, google it, try a fix, read the next error. Agents do it in seconds, not minutes, and without cognitive load or fatigue.

According to 2025 research, self-reflective agentic patterns outperform single-shot GPT-4 calls by 20–40% on complex coding tasks, precisely because they get multiple attempts to self-correct.

2. Agents Scale Through Composition, Not Brute Force

You don’t need one giant superintelligent model. You need small, composable loops. One agent writes code. Another runs tests. A third uses Playwright to verify the UI. A fourth checks performance. Each is simple—a 10-line prompt template. But chained together, they handle end-to-end delivery.

This is why Gartner predicts 33% of enterprise software will depend on agentic AI by 2028. It’s not because models got smarter—it’s because loops got composable.

3. Agents Create Their Own Observability

Here’s the part that blew my mind: the agent doesn’t wait for you to add logging. When it can’t diagnose a bug, it writes the debug code itself. It injects print statements, adds trace middleware, writes a test script that dumps internal state, runs it, reads the output, and then deletes the debug code when it’s done.

This is self-healing infrastructure. The agent instruments, observes, fixes, and cleans up—autonomously. Traditional monitoring tools are reactive (you configure them in advance). Agentic systems are generative (they create the observability they need, when they need it).

4. Agents Turn Failures into Training Data

When an agent solves a problem, it writes the solution into a test.md or playbook.md file. The next time it encounters a similar task, it reads the playbook first. It’s not learning in the ML sense—it’s accumulating institutional knowledge in markdown.

Over time, your agent builds a library of solved problems. “Last time I deployed to this server, I had to whitelist the IP first—let me check that now.” It becomes faster and more reliable with each loop, without retraining the model.

5. The Business Impact Is Compounding

Agentic loops turn LLMs from code suggesters into code shippers. The business impact is immediate and compounding:

10x faster iteration: An agent can do 5–10 test-fix cycles in the time it takes a human to do one. That’s not “10% faster”—it’s an order of magnitude.
Lower cost per feature: One agent loop replaces 3–5 back-and-forth messages with a developer, plus the context-switching cost of stopping to debug, then resuming.
Consistent quality: The agent can’t forget to run the linter, skip the tests, or deploy without checking logs. The loop enforces discipline.
24/7 availability: Agents don’t sleep, don’t take weekends, don’t burn out. You can queue 50 tasks Friday night and review 50 PRs Monday morning.
Leverage for small teams: A 3-person startup can ship like a 15-person team. You’re not replacing developers—you’re giving each developer an autonomous assistant that handles the grunt work.

You still need humans for architecture, product decisions, edge cases, and judgment calls. But the agent handles the repetitive read-error-fix-rerun grind that burns 40% of a developer’s day. It’s like having a junior dev who never gets tired, never gets distracted, and learns your codebase with every loop.

6. Agentic Systems Are Auditable and Deterministic

Because the loop is structured (plan → act → reflect → recover) and logged (every action, every outcome), you can replay any session. The agent writes an NDJSON journal with timestamps, tool calls, outputs, and decisions. If something goes wrong in production, you can trace exactly what the agent did, why it did it, and what it observed.

This is critical for regulated industries (finance, healthcare, government) where “the AI did it” isn’t an acceptable explanation. With agentic loops, you have a full audit trail—readable by humans, replayable for compliance.

Pitfalls to Watch

Agentic loops are powerful, but they’re not magic. Here’s what can go wrong and how to guard against it:

Infinite loops: Always set a max-iteration budget (10–15 loops). If the agent hasn’t recovered by then, it’s stuck—escalate to a human with a summary of what it tried and where it’s blocked.
No observability: Log every plan, action, and outcome to NDJSON. You will need to debug why the agent made a weird decision at 3am. Timestamps, tool calls, and outcomes are non-negotiable.
Tool sprawl and security: Limit the agent’s tools. If it can run rm -rf, curl arbitrary domains, or SSH into production, you’re one bad prompt away from disaster. Use MCP or similar to scope permissions (e.g., “you can read/write files in /workspace, run pytest, and call Playwright—nothing else”).
Batch anti-pattern: Don’t make the agent do 100 things in parallel. Serial is better—do one task, reflect, do the next. Micro-batches of 3–10 tasks are the upper limit. Why? Because observability drops and recovery paths explode when you parallelize too much.
Overfitting to one codebase: If the agent writes too many playbooks specific to your current project, it won’t generalize. Balance project-specific knowledge (test.md) with reusable patterns (how to deploy with Playwright, how to debug CORS issues).
Lack of human review: Agents are great at implementation, but they’re not architects. Review the plan before the agent spends 50 API calls on the wrong approach. Review the PR before it hits production. Agentic doesn’t mean unattended.

Where to Go Next

Start small. Pick one repetitive, well-defined task: “run tests and fix import errors” or “generate a Playwright test for the login flow and verify it passes.” Write the prompt template (plan → act → reflect → recover). Give the agent access to the tools it needs (shell, code editor, Playwright). Set a loop budget. Watch it work (or fail). Read the logs. Tighten the prompt. Add guardrails.

Then compose. Once you trust one loop, chain multiple loops together:

Agent A: Writes code and unit tests
Agent B: Runs integration tests with Playwright
Agent C: Checks performance (load tests, memory profiling)
Agent D: Updates docs and writes changelog
Agent E: Reviews all outputs and creates a summary PR

Each loop is simple. The composition is what makes it powerful. You’re not building one super-agent. You’re orchestrating a team of specialists, each with a narrow, clear job.

Then scale. Once you’ve proven the pattern on one team or one project, replicate it. New project? Copy the prompt templates. New developer? They inherit 6 months of institutional knowledge encoded in playbooks. New use case? Adapt the reflect step (maybe you’re not testing code—you’re validating marketing copy or analyzing customer support tickets). The loop pattern is universal.

The Bigger Picture

Agentic coding isn’t a new paradigm. It’s an old idea—test-driven development, CI/CD, self-healing systems, observability-driven debugging—repackaged with LLMs that can read, write, reason, and persist through failure.

The breakthrough is simple but profound: give the model a chance to see its own work and iterate. That’s it. That’s the game.

One-shot LLMs are impressive demos. Agentic loops are production systems.

Plain and spicy: agents loop until they’re done. Everything else is just configuration.

Ready to build your first loop? Start with the AGENTIC_LOOP.md template above. Give your agent a small, bounded task. Let it run. Read the logs. Adjust the prompts. Ship it. Then tell me what happened—I want to hear what worked, what broke, and what surprised you.