Your AI Can Code. You Just Don’t Know How to Drive It.

A two-level field guide to AI coding agents — from your first session to running 5 in parallel.

Scott Farrell
AI Strategy & Architecture | leverageai.com.au

TL;DR

The problem isn’t the AI — a controlled study found experienced developers were 19% slower with AI tools, largely because they skipped planning and verification.¹
The fix is structured driving — plan first, define tests, loop until green. The Claude Code creator reports 2-3x quality improvement from verification loops alone.²
Fat AGENTS.md files are now an evidence-backed anti-pattern — peer-reviewed research shows they reduce success rates while increasing costs 20%+.³

The Uncomfortable Truth

84% of developers now use or plan to use AI coding tools.⁴ But only 33% fully trust the code that comes out.⁵ Positive sentiment for AI tools dropped from 70%+ to 60% in the latest Stack Overflow survey.⁶

Something isn’t working.

Here’s what is working: at Anthropic, 90% of Claude Code’s own codebase is now written by Claude Code.⁷ Theo Browne runs 6 parallel instances and hasn’t opened an IDE in days.⁸ Addy Osmani on Google’s Chrome team writes specs before a single line of code is generated.⁹

The gap between “AI can’t code” and “AI just built my app” isn’t the model. It’s how you drive it.

19% slower

Experienced developers with AI tools — when they skip structured workflows¹

METR’s randomised controlled trial found experienced open-source developers were 19% slower with AI tools — yet believed they were 20% faster.¹ The study identified the cause: “imperfect use of tools, notably overly simple prompts” and “limited familiarity with AI interfaces.”¹⁰

That’s not a tool problem. That’s a driving problem.

What follows is the fix. Two levels: the essentials to get started tonight, and the advanced moves for when you’re ready to run multiple instances in parallel.

Level 1: The Essentials

Seven steps that make AI coding actually work. Do these before anything else.

Set Up the Cockpit

A coding agent without tools is an eager intern sealed inside a glass box. Before you open Claude Code or Codex, make sure the agent can actually do things:

Install these:

brew — package management (macOS) or your platform equivalent
python + miniconda — scripting, test harnesses, automation
playwright — browser automation for visual testing
gh — GitHub CLI for PRs, issues, branch management
node + npm — if you’re building anything web
A linter and formatter for your language (ruff, eslint, prettier)

The agent needs to run builds, execute tests, inspect browser output, and commit code. If it can’t do those things, you’re asking it to code with its hands tied.

Always Start in Plan Mode

This is the single most impactful habit. Never let the agent just start coding.

In Claude Code: Press Shift+Tab twice to enter Plan Mode. The agent will analyse your codebase with read-only operations, ask clarifying questions, and propose a plan — before touching any code.¹¹

Boris Cherny, the creator of Claude Code, confirms: “Most sessions start in Plan mode. I go back and forth with Claude until I like its plan. From there, I switch into auto-accept edits mode and Claude can usually one-shot it. A good plan is really important.”²

Don’t: Type “build me a login page” and let it rip. That’s how you get three hours of debugging for ten minutes of work.

Make the Plan Vivid

Vague plans produce vague code. Your plan should include:

What you want done — be specific. Include examples of expected input/output.
What not to change — don’t introduce new frameworks, don’t refactor unrelated files, don’t add abstractions unless duplication is proven.
Edge cases — what happens with empty input? Duplicate entries? Network errors?

Example plan for a Tic Tac Toe game:

“Build Tic Tac Toe in plain JavaScript. No framework. Single HTML file. Include: reset button, score tracking across games, winner detection (rows, columns, diagonals), draw detection. X always goes first. The board is a 3×3 grid using CSS grid. Keep it simple — no animations, no AI opponent.”

Cole Medin puts it well: “One line of bad code is just one line of bad code. One line of a bad plan is maybe 100 lines of bad code.”¹²

Tell It to Search First

Don’t let the agent code from stale memory. Tell it to look things up:

Add to your plan: “Before implementing, search the current official docs for [API/library]. Find current examples of how others are doing this. Compare 2-3 patterns, recommend one, explain why.”

API details drift constantly. The agent’s training data is months old. A 30-second web search prevents an hour of debugging against a deprecated API.

Define How to Test — This Is the Most Important Step

This is where most people fail. Not “add tests.” Not “it should work.” Name the exact commands that prove the thing is alive.

For the Tic Tac Toe example:

“Write a Playwright test script that proves: (1) X can win via a row, (2) O can win via a diagonal, (3) draws are detected when the board fills, (4) the reset button clears the board and resets scores. Run with npx playwright test.”

The Claude Code creator is unequivocal: “Probably the most important thing to get great results out of Claude Code — give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result.”²

You don’t trust the AI. You trust the tests. The test harness is what converts “generating code” into “shipping working software.”

Loop Until All Tests Pass

The loop is the magic, not the model. The pattern is:

Plan
Implement
Run tests
See failures
Repair
Run tests again
Repeat until green

This is identical to OpenAI’s official Codex architecture: “Plan, edit code, run tools (tests/build/lint), observe results, repair failures, repeat.”¹³ In one documented run, Codex ran for 25 hours uninterrupted, generated 30,000 lines of code, and continuously verified and repaired as it went.¹³

Each iteration gets closer because context accumulates. The agent sees its own failures. That’s the mechanism. Don’t interrupt the loop by taking over and manually fixing — let the agent repair its own work.

Review the Plan Before It Codes

Before the agent writes a single line, read the plan it produced in Step 2. Tell it what’s wrong.

Ask yourself:

Does this plan cover the edge cases I care about?
Is there an assumption that will break?
Are the test criteria specific enough?
Is the scope too broad? (If yes, narrow it.)

If the agent codes wrong later, your plan was poor. Fix the plan, not the code.

Level 2: Advanced Moves

Workflow controls for when you’re past the basics. These stop the agent from doing clown cartwheels through your repo.

Inspect Before Planning

Before the agent proposes anything, have it explore the repo first. Entry points, dependencies, config files, test commands, build commands, existing conventions.

Prompt: “Before proposing a plan, explore this repo. Identify: the build command, test command, entry points, existing patterns for [feature type], and any relevant config. Then propose your approach.”

Anthropic’s engineering team recommends this exact pattern: “explore-plan-code-commit.”¹⁴

Constraints, Not Just Goals

Tell the agent what it must not do. This prevents “AI architecture astronaut syndrome.”

Add to every plan:

Don’t introduce new frameworks or libraries
Don’t refactor unrelated files
Don’t change public APIs unless required
Don’t add abstractions unless duplication is proven
Don’t write mocks when real execution is available

Name the Exact Verification Commands

Not “it should work.” Not “add tests.” The exact commands:

Verification:
  - Build: npm run build (must exit 0)
  - Lint: npx eslint src/ (zero warnings)
  - Typecheck: npx tsc --noEmit (must pass)
  - Tests: npx playwright test (all green)
  - Manual: open index.html, click through a game

Cursor’s official best practices say it plainly: “Use typed languages, configure linters, and write tests. Give the agent clear signals for whether changes are correct.”¹⁵

Build a Disposable Test Harness First

Before the main implementation, have the agent build a tiny exerciser: a script, a local page, a seeded database runner, a Playwright probe, a curl-based smoke script.

Prompt: “Before implementing the feature, create a minimal test harness that can exercise the target independently. I want to see it fail first, then pass after implementation.”

This turns debugging from interpretive dance into a measurable system. The Playwright MCP server now lets coding agents use the browser directly — write tests, run them, see them fail, fix the code, run again.¹⁶

Force Checkpointing

After each meaningful step, require a summary: what changed, what passed, what failed, what the next repair step is.

Add to plan: “After completing each milestone, write a checkpoint: (1) files changed, (2) tests run and results, (3) issues found, (4) next step. Commit the checkpoint.”

OpenAI’s Codex long-horizon pattern uses exactly this: “Milestones small enough to complete in one loop. Acceptance criteria + validation commands per milestone. Stop-and-fix rule: if validation fails, repair before moving on.”¹³

Separate Research from Edit Mode

Split the work into phases:

Research current APIs, patterns, and examples
Propose approach
Implement
Verify
Repair

This prevents the common failure mode where the agent half-researches while half-editing and produces an elegant pile of nonsense.

Prove the Bug Exists Before Fixing It

Before coding a fix, require a reproduction: a failing script, failing browser flow, failing request, failing sample input. Then the finish line is not “looks fine” but “the reproduction now passes.”

Prompt: “Before fixing this bug, first write a minimal reproduction that demonstrates the failure. Then fix it. The fix is complete when the reproduction passes.”

Use Parallelism Deliberately

Boris Cherny runs 5 Claude Code instances in his terminal, numbered 1-5, plus 5-10 more in the browser.¹⁷ But not for the same task — each instance works on a separate concern:

Tab 1: feature implementation
Tab 2: test writing
Tab 3: bug fixing
Tab 4: code review
Tab 5: documentation

Multiple simple sessions with focused scope beat one overloaded session trying to do everything.

Prefer Logs Over More Prompting

When the agent produces wrong output, don’t write a better prompt. Look at what actually happened.

Logs, screenshots, HTML dumps, API responses, DOM snapshots, SQL outputs, generated files — these beat arguing with the model. Your Playwright setup from Step 1 pays for itself here: take a screenshot, show it to the agent, say “this is what it looks like — fix it.”

Keep Tasks Binary

“Build the app” is not a task. “Create a Tic Tac Toe page with reset button, winner detection, and a Playwright test that clicks through a winning sequence” is a task.

The difference: one has a finish line with teeth. The other is a wish. Keep tasks narrow enough that “done” is either true or false.

The Anti-Pattern: Fat AGENTS.md Files

If you’ve been stuffing your AGENTS.md or CLAUDE.md with pages of instructions, coding standards, framework preferences, and architectural doctrine — stop.

ETH Zurich’s peer-reviewed study (February 2026) tested this directly: LLM-generated context files reduce task success rates compared to providing no context at all, while increasing inference cost by over 20%.³

The findings are specific: “Agents that receive context files run more tests, search more files, traverse more of the repository, and generate more reasoning output. They explore more thoroughly. But thorough exploration isn’t the same as correct exploration.”³

Even human-written context files add 14-22% more reasoning tokens and 2-4 additional steps per task. Following instructions costs compute, regardless of whether those instructions help.³

What to do instead:

Keep CLAUDE.md slim — only non-obvious, non-redundant information. Tool choices that diverge from defaults. Test configurations that aren’t apparent from the code. That’s it.
Put instructions in the plan — task-local constraints stay fresh and relevant. Global doctrine gets stale and contradictory.
A CLAUDE.md that restates the README is probably hurting more than helping.³

Boris Cherny does use a shared CLAUDE.md — but it’s compact, team-curated, and checked into git. When Claude does something wrong, he adds a specific note. That’s institutional learning. A 500-line auto-generated doctrine file is not.²

The live loop beats static scripture. Put instructions where they’re used — in the plan.

Try It Tonight: The Tic Tac Toe Challenge

Open Claude Code (or your agent of choice). Use the Level 1 workflow:

Enter Plan Mode (Shift+Tab twice)
Give it this plan:

"Build Tic Tac Toe in plain JavaScript, no framework.
Single HTML file with embedded CSS and JS.

Requirements:
- 3x3 grid using CSS grid
- X always goes first
- Winner detection: rows, columns, diagonals
- Draw detection when board fills with no winner
- Reset button that clears the board
- Score tracking across games (X wins, O wins, draws)

Constraints:
- No external dependencies
- No animations
- No AI opponent
- Keep it under 200 lines total

Testing:
Write a Playwright test (test.spec.js) that proves:
1. X can win via a row
2. O can win via a diagonal
3. Draws are detected when board fills
4. Reset button clears board and resets scores

Run: npx playwright test
All 4 tests must pass."

Review the plan. Tell it what’s wrong.
Let it implement. Watch the verification loop.
When all 4 tests pass, you’re done.

That’s it. You’ll see the difference between “describe and hope” and “plan, test, and loop” in about 20 minutes.

The Bottom Line

AI coding agents are real. 90% of Claude Code’s codebase is written by Claude Code.⁷ But using them well is “difficult and unintuitive” — even Anthropic’s engineers say so.¹⁸

The fix isn’t better prompts. It’s a better workflow:

Set up the cockpit — give the agent tools it can actually use
Plan first — always, every time, no exceptions
Define tests — the exact commands that prove success
Loop until green — the loop is the mechanism
Keep AGENTS.md slim — put instructions in the plan, not in doctrine files

If this sounds like good engineering practice — congratulations, it is. The tools changed. The discipline didn’t.

Scott Farrell helps Australian mid-market leadership teams turn scattered AI experiments into governed portfolios that compound EBIT. More at leverageai.com.au

References

[1]METR. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ — “When developers are allowed to use AI tools, they take 19% longer to complete issues — a significant slowdown that goes against developer beliefs and expert forecasts.”
[2]Boris Cherny (Claude Code Creator). “Claude Code Creator Boris Shares His Setup with 13 Detailed Steps.” reddit.com/r/ClaudeAI/comments/1q2c0ne/ — “Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result.”
[3]ETH Zurich SRI Lab. “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” arxiv.org/abs/2602.11988 via academy.dair.ai/blog/agents-md-evaluation — “LLM-generated context files reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.”
[4]Stack Overflow 2025 Developer Survey, via Keyhole Software. keyholesoftware.com/software-development-statistics-2026-market-size-developer-trends-technology-adoption/ — “84% of developers now use or plan to use AI tools in their development process.”
[5]GetPanto. “AI Coding Assistant Statistics 2026.” getpanto.ai/blog/ai-coding-assistant-statistics — “Only ~33% fully trust AI-generated code.”
[6]Stack Overflow Developer Survey 2025, via GetMocha. getmocha.com/blog/ai-app-builder-statistics/ — “Positive sentiment for AI tools: 60%, down from 70%+ in prior years.”
[7]Fortune. “Top engineers at Anthropic, OpenAI say AI now writes 100% of their code.” fortune.com/2026/01/29/100-percent-of-code-at-anthropic-and-openai-is-now-ai-written-boris-cherny-roon/ — “For Claude Code, about 90% of its code is written by Claude Code itself.”
[8]Theo Browne. “You’re Falling Behind. It’s Time to Catch Up.” youtube.com/watch?v=Z9UxjmNF7b0 — “I’m running six Claude Code instances in parallel. I haven’t opened an IDE in days.”
[9]Addy Osmani. “My LLM Coding Workflow Going Into 2026.” addyosmani.com/blog/ai-coding-workflow/ — “Don’t just throw wishes at the LLM — begin by defining the problem and planning a solution.”
[10]ACTUIA. “A METR Study Reveals that AI Slows Down Experienced Developers.” actuia.com/en/news/a-metr-study-reveals-that-ai-slows-down-experienced-developers/ — “Imperfect use of tools, notably overly simple prompts; limited familiarity with AI interfaces.”
[11]Claude Code Official Documentation. “Common Workflows.” code.claude.com/docs/en/common-workflows — “Plan Mode instructs Claude to create a plan by analyzing the codebase with read-only operations.”
[12]Cole Medin. “My Complete Agentic Coding Workflow to Build Anything.” youtube.com/watch?v=goOZSXmrYQ4 — “One line of bad code is just one line of bad code. One line of a bad plan is maybe 100 lines of bad code.”
[13]OpenAI for Developers. “Run Long Horizon Tasks with Codex.” developers.openai.com/blog/run-long-horizon-tasks-with-codex/ — “Codex ran for about 25 hours uninterrupted, used about 13M tokens, and generated about 30k lines of code.”
[14]Elegant Software Solutions. “Plan Mode and Thinking Strategies.” elegantsoftwaresolutions.com/blog/claude-code-mastery-plan-mode-thinking — “Anthropic’s engineering team recommends the ‘explore-plan-code-commit’ workflow.”
[15]Cursor. “Best Practices for Coding with Agents.” cursor.com/blog/agent-best-practices — “Use typed languages, configure linters, and write tests. Give the agent clear signals for whether changes are correct.”
[16]Shipyard. “Test-First Development with Agents and the Playwright MCP.” shipyard.build/blog/test-first-development-playwright-mcp/ — “The Playwright MCP server lets your coding agent use the browser — write tests, run them, see them fail, fix the code, and run them again.”
[17]InfoQ. “Inside the Development Workflow of Claude Code’s Creator.” infoq.com/news/2026/01/claude-code-creator-workflow/ — “Cherny runs many sessions in parallel, including five locally in his MacBook’s terminal and 5-10 on Anthropic’s website.”
[18]Addy Osmani via Fortune. “My LLM Coding Workflow Going Into 2026.” medium.com/@addyosmani/my-llm-coding-workflow-going-into-2026-52fe1681325e — “Using LLMs for programming is ‘difficult and unintuitive’ and getting great results requires learning new patterns.”

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Your AI Can Code. You Just Don’t Know How to Drive It.

Your AI Can Code. You Just Don’t Know How to Drive It.

TL;DR

The Uncomfortable Truth

Level 1: The Essentials

Set Up the Cockpit

Always Start in Plan Mode

Make the Plan Vivid

Tell It to Search First

Define How to Test — This Is the Most Important Step

Loop Until All Tests Pass

Review the Plan Before It Codes

Level 2: Advanced Moves

Inspect Before Planning

Constraints, Not Just Goals

Name the Exact Verification Commands

Build a Disposable Test Harness First

Force Checkpointing

Separate Research from Edit Mode

Prove the Bug Exists Before Fixing It

Use Parallelism Deliberately

Prefer Logs Over More Prompting

Keep Tasks Binary

The Anti-Pattern: Fat AGENTS.md Files

Try It Tonight: The Tic Tac Toe Challenge

The Bottom Line

References

Related

Discover more from Leverage AI for your business

Leave a Reply Cancel reply