Designing Loops, Not Prompts: A Field Guide to Agentic Loops β€” and Who Holds the State Machine

SF Scott Farrell β€’ June 19, 2026 β€’ scott@leverageai.com.au β€’ LinkedIn

Field Guide Β· Agentic Engineering

Designing Loops, Not PromptsA Field Guide to Agentic Loops β€” and Who Holds the State Machine

πŸ“˜ Want the complete guide?

Learn more: Read the full eBook here β†’

Everyone is sorting agentic loops by what triggers them. That is the easy axis. The one that actually predicts whether a loop is worth running is who holds its state machine β€” and the loops that compound keep their state outside any single agent.

By Scott Farrell Β· LeverageAI Β· For builders of compounding-AI-research systems

TL;DR

  • The reframe is real. The unit of work has moved from the prompt to the loop. But sorting loops by their trigger (manual, cron, event, agent-initiated) is the surface taxonomy.
  • The second axis does the real work: who holds the state machine β€” your head, fixed code you wrote ahead of time, or a durable external medium any agent can read and write. Durability of external state, not the trigger, predicts robustness.
  • Type 4 is a phase transition, not “more advanced.” It is a loop that writes its own sub-loops at runtime β€” because you can never be more dynamic than code.
  • For building things that build things, apply the compounding test: a loop that accumulates findings is linear; a loop that becomes the substrate the next loop reads from is the one that compounds.

On June 7, 2026, Peter Steinberger β€” the creator of OpenClaw β€” posted two sentences that the entire AI-coding timeline then spent a week arguing about:

“Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”
β€” Peter Steinberger (@steipete), June 20261

It landed at around six and a half million views with no diagram and no repo link, which is exactly why it ricocheted: it named a shift everyone could feel but nobody had crisply stated. The head of Claude Code at Anthropic, Boris Cherny, made the same point at Acquired Unplugged in June 2026 β€” he no longer prompts the model directly; his job is to write the loops that prompt it.2 Google’s Addy Osmani gave the wave a name: loop engineering.3

For two years the posture was simple. You wrote a good prompt, fed enough context, read what came back, and typed the next thing. The agent was a tool and you held it the entire time, one turn after the other. The prompt was the instruction; you were the loop. What Steinberger is naming is the moment that posture ends β€” the loop becomes a small system that finds the work, hands it out, checks it, records what was done, and decides the next thing, while you watch instead of type.

That much is now consensus. The interesting question starts one level down. Once you accept that you are designing loops, what kinds of loops are there, and which one should you build? The standard answer is a taxonomy by trigger. It is correct, and it is not enough. This piece gives you that taxonomy, then a second axis underneath it that actually predicts whether a loop will hold up β€” and a test you can run on a loop you operate today.

One scoping note, because it changes every recommendation that follows. This is written for the builder standing up loops for AI research, trial projects, building things that build things β€” compounding the knowledge in a self-maintaining wiki, not shipping regulated production code. The reliability engineering you would do for a bank is a different essay. Here, the loop’s product is rarely merged code. It is knowledge that makes the next loop better.


The four loop types, by trigger

The cleanest field demonstration of the taxonomy in the wild comes from Theo Browne (t3.gg), who has been running these loops hard and filming the results. Sorted by what starts each loop, there are four.

1. The self-paced linear loop (the Ralph loop, /goal)

One thread, one task, a stopping condition, and an “are we done? no? keep going” check at the end of every turn. This is the Ralph loop, coined by Geoffrey Huntley in July 2025 and almost insultingly simple in its purest form:4

while :; do cat PROMPT.md | claude-code ; done

Huntley calls it “a bash loop that feeds an AI’s output (errors and all) back into itself until it dreams up the correct answer… brute force meets persistence,”4 and, more memorably, “deterministically bad in an undeterministic world.” The non-obvious detail is that fresh context every iteration is the point, not a side effect: models degrade as their context fills, and a bash loop sidesteps the drift by starting clean each pass. His instruction to operators β€” “sit on the loop, not in it” β€” is the whole posture in five words. Anthropic later shipped supported equivalents: /loop, /goal, the ralph-wiggum plugin.

The structure is fixed; only the content changes each iteration. Cheap to reason about, easy to steer. But it cannot restructure itself β€” it grinds toward a goal along a predefined path. The failure mode is exactly what you would predict: on a long run, the error rate climbs, because nothing external is checking the work mid-flight.

2. The scheduled / cron loop

Same machinery, but the trigger is a clock. The interesting property is not the timing β€” it is that the loop is decoupled from your attention. You wake up to a PR. Steinberger’s own “simple loop” is the canonical shape: “Tell codex to maintain your repos, wake up every 5 minutes and direct work to threads,”1 routed through an orchestrator skill so some work lands autonomously.

The instructive degenerate case is a keep-alive cron whose entire job is to ping a rate limit so a precondition stays warm for other loops. That is worth internalizing: scheduled loops are frequently infrastructure for other loops, not value producers themselves. Do not mistake the heartbeat for the work.

3. The event-triggered loop

Now the trigger is external state changing in the world: a review comment lands, a CI check goes red, a new PR head appears. Theo’s sharpest pattern here is two agents unaware of each other on one PR β€” one produces, one reviews, and the PR itself is the shared blackboard between them. Neither holds the other’s state. This is more robust than the linear loop precisely because the coordination medium is durable and external. If either agent dies, the state survives on the PR. Hold onto that sentence; the next section is built on it.

4. The agentically-initiated dynamic loop

This is the frontier, and it deserves its own section, because it is not “type 1 but fancier.” It is a different kind of object. We will come back to it after we install the axis that explains why.


The axis that actually matters: who holds the state machine

The trigger taxonomy tells you when a loop fires. Starting is the cheap part. The axis that determines whether a loop is worth running at all is who holds its state machine β€” where the “where are we, what’s done, what’s next” lives between turns. There are three answers, and they are a ladder of durability.

Where the state lives What that means What happens when the agent dies mid-run
In your head You are the loop. You remember what’s done and decide the next turn. The pre-loop-engineering default. Nothing survives but your memory. Doesn’t scale past your attention.
In fixed code you wrote ahead of time The loop’s structure is hard-coded: phases, prompts, branches. Types 1–3 live here. State the code persisted survives; structure you didn’t anticipate cannot appear.
In a durable external medium The PR, the wiki, a state file β€” a blackboard any agent can read and write. The state survives any single agent. The medium is the memory.

This is not a new claim for us; it is the spine of the long-running-agents work. The counter-intuitive truth there was that agents which run for ten hours don’t stuff everything into context β€” they implement stateless workers plus stateful orchestration. The agent is stateless; the orchestration is stateful; state persists externally.5 State inside the agent evaporates when the context resets. State outside it compounds. The whole “1-hour ceiling” is an architecture choice, not a model limitation.5

So restate the event-triggered loop’s robustness in these terms, because it is the proof of the axis. The two-agents-on-a-PR pattern is robust because the state machine is held by the PR, not by either agent. The producer doesn’t hold the reviewer’s state and the reviewer doesn’t hold the producer’s; both read and write a durable, external blackboard. Durability beats coordination. You do not make a multi-agent loop reliable by making the agents coordinate better. You make it reliable by putting the state somewhere that survives any one of them.

The trigger tells you when a loop fires. Who holds the state machine tells you whether it survives. Optimise the second one.

The same four types, now as one dial: Karpathy’s autonomy slider

Once you sort loops by how much of the state machine you’ve handed over, the flat list of four becomes a spectrum β€” and Andrej Karpathy already named the spectrum. In his June 2025 talk “Software Is Changing (Again),” he described an autonomy slider, drawn from his years on Tesla Autopilot:

“Depending on the complexity of the task at hand, you can tune the amount of autonomy that you’re willing to give up for that task.”
β€” Andrej Karpathy, “Software Is Changing (Again),” June 20256

His image for it is the Iron Man suit: “it’s both an augmentation and Tony Stark can drive it. And it’s also an agent… this is the autonomy slider. We can build augmentations or we can build agents.”6 The prescription he closes on is not “max autonomy” β€” it is the opposite: “this is one way of keeping the AI on leash… and the AI is not getting lost in the woods.”6 He warns builders are getting “way too excited” about fully autonomous agents. “It’s less Iron Man robots and more Iron Man suits that you want to build.”6

Map the four loop types onto that dial and the taxonomy lifts from a list into a setting you choose:

Loop type Trigger Who holds the state machine Slider position
1. Linear (Ralph, /goal) You / a stop condition Fixed code; a state file Low β€” supervised, you hold the leash
2. Scheduled (cron) A clock Fixed code, decoupled from your attention Low–mid β€” runs without you, structure still yours
3. Event-triggered External state change A durable external blackboard (the PR) Mid β€” agents act, the medium holds state
4. Agentically-initiated dynamic The agent itself The agent β€” it writes the state machine at runtime High β€” self-structuring; leash is shortest where it matters most

This is the synthesis the three authorities give you together, and it is worth keeping their contributions distinct: Steinberger supplies the reframe (design loops, not prompts); Theo supplies the demonstrated taxonomy and the type-4 loop in the wild; Karpathy supplies the slider that turns the four types into a spectrum of handed-over control. Software 3.0, in his framing, is the world where “you program them in English”7 β€” and the loop is how you program in English at scale, without holding the tool every turn. The slider is the reminder that “at scale” is not the same as “all the way to the right.”


The actual phase transition: loops that write loops

Now we can say precisely what makes type 4 different. The first three loops are loops you designed. The structure is fixed; only the content changes each turn. The fourth is a loop that designs its own sub-loops at runtime. That is not a quantitative jump up the slider. It is a change of category β€” the point on the dial where the agent stops running your state machine and starts writing one.

Theo’s concrete instance is the one to hold in your head. In a PR-audit workflow, the agent did not call a workflow feature. It wrote roughly 240 lines of throwaway JavaScript that defined its own phases, schemas, prompts-as-functions, and a pipeline β€” and then that code orchestrated the sub-agents through it. His line is the one to internalize:

“You can never be more dynamic than code… when the agent can write code, it is effectively building its own custom feature every time.”
β€” Theo Browne (t3.gg)8

Read that twice, because it inverts how most people think about code in agent systems. Code is usually the agent’s output β€” the thing it produces and you ship. Here, code is the step between model runs. A hard-coded workflow primitive forces the agent into your pre-built shape: here are the phases you may use, here is how to structure a prompt. Letting it write the orchestration code means the shape of the loop matches the shape of the problem β€” including problems you never anticipated when you built the harness. A fixed feature can only ever be as dynamic as the cases its author imagined. Code the agent writes on the spot has no such ceiling.

We have argued the same move before, from the other direction: the most profound capability is not that agents write code, it is that they “write code that writes code, build tools that build better tools.”9 Type 4 is that idea pointed at the loop itself β€” the agent building its own orchestration the way an engineer would reach for the right control structure, except it reaches at runtime, per problem.

The cost is real and worth stating plainly. It is non-deterministic β€” the model sometimes writes invalid code while pushing limits. It is more expensive. And you lose the legibility a fixed primitive gives you: you can read your own pipeline, but you cannot fully predict the one the agent will write tonight. For regulated production, that trade is usually wrong. For research and exploration β€” surfacing structure you didn’t know to look for β€” it is exactly right, because you are not optimising for repeatability. You are optimising for the agent discovering a shape you couldn’t have specified.

Where it lands on the dial
Type 4 sits highest on the autonomy slider β€” and that is precisely why Karpathy’s leash matters most here. A self-structuring loop with nothing external to check it is the failure mode at maximum amplitude. The next two sections are the leash: external durable state, and a closer that can say no.

Durability beats coordination β€” and a loop needs something that can say no

The single best external line in the whole loop-engineering discourse was a reply under Steinberger’s tweet, from @mosyaseen:

“Designing the loop is half of it. The other half is putting something in the loop that can say no: a test, a type check, a real error. A loop with nothing to push back is the agent agreeing with itself on repeat.”
β€” @mosyaseen10

This is the leash, stated operationally. The Ralph loop’s error rate climbs on long runs for exactly this reason: nothing external pushes back mid-flight. Anthropic’s supported /goal ships a closer as a feature β€” after each turn, a separate fast evaluator model judges whether the success condition holds; the model doing the work is not the model grading it. That separation is the whole point. A loop grading its own work is the agent agreeing with itself.

Notice that the closer and the durable medium are usually the same object. The PR is both the shared blackboard and the place a red CI check can say no. The wiki is both the external state and the place a consolidation can be reverted if it merged two ideas it shouldn’t have. This is why durability beats coordination is not a slogan: the durable external medium is simultaneously what survives an agent dying and what gives a second, adversarial process something to push back against. State held inside one agent can do neither.

The minimum viable version of this β€” the leash you can rig tomorrow with no new tools β€” is four things: external state persistence (get progress out of the context window into a file), explicit completion criteria (verifiable conditions, not “feels done”), checkpoint discipline (compress and persist after each step), and context hygiene (evict cold data aggressively).5 Those four turn a Ralph loop from a thing that drifts into a thing that holds. They are also, not coincidentally, the four moves that put the state machine somewhere durable.


The compounding test: accumulate, or become the substrate

Everything so far applies to any loop. Here is the part written specifically for the reader building things that build things, where the loop’s product is not merged code but knowledge. For that reader there is one test that sorts the loops worth building from the ones that only look busy.

The compounding test
A loop that merely accumulates findings β€” writes entries you later read β€” is linear. A loop whose output becomes the substrate the next loop reads from β€” structured entries a future agent loads as priors to decide what to investigate next β€” is the one that genuinely compounds. The loop’s real product is knowledge that makes the next loop better.

Most people building a knowledge loop build the accumulate version, because it is the obvious half. Loops write entries; you read them. It grows. It also gets slower to think with as it grows, because an append-only pile is just the swamp with extra steps. The compounding version requires two things the accumulate version skips.

First, the entries have to be structured enough to be read back as input β€” not prose, but claims and edges, so a future agent navigates the relationships instead of re-inferring them. That structure is the difference between a loop that runs and a loop that compounds; we made the full argument for it in The Index Is the Data11, and Karpathy independently named the artefact the “LLM Wiki” β€” a persistent, interlinked set of markdown files where “the knowledge is compiled once and then kept current, not re-derived on every query.”12 That wiki-graph is the durable external medium a compounding loop writes to. It is the third rung of the state-machine ladder, made into an asset.

Second β€” and this is the half almost everyone skips β€” the loop needs a consolidation step, not just an ingestion step. A loop that only writes is a knowledge graveyard. The fix is a second agent whose job is to subtract: combine redundant claims, fade stale ones, fold flat facts into edges, spin clusters into their own pages. We call it the Janitor, and it is reflection running on your corpus β€” the same mechanism that kept long-horizon agents coherent in the Generative Agents research, where removing the reflection step degraded them into repetition.11 “The index didn’t just get smaller. It got smarter with every janitorial pass.”11 If you build one loop for your wiki, build the consolidation loop, because the ingestion loop is the half that does not, by itself, compound.

The compounding test, then, comes down to a question about where your loop’s output goes. Does it land in a pile you read, or does it land in a structured, self-compacting map that the next loop consults as priors before it decides what to do? Theo’s better workflows already did the compounding version without naming it β€” his PR audit loaded priors and clusters from prior runs. The first kind of loop is a scheduled-sweep with a weak closer. The second is the machine you actually want: one that generates its own next questions and scores them against a documented North Star.


The honest caveat: extract the apparatus, not the theater

It would be easy to read Theo’s demos and conclude the lesson is scale β€” 55-agent tournaments, hundreds of dollars to pick between three PRs, “going hard” on subsidised inference. It is not. That intensity is a function of his temporary economics: he is optimising to burn subsidised tokens before they disappear, explicitly not for value-per-token. The 55-agent tournament to choose among three PRs is theater. For your context β€” your own hardware, your own budget, a wiki meant to compound over years β€” copying the intensity is copying the wrong variable.

The thing to extract is the architecture: dynamic, self-structuring, artifact-producing loops with durable external state and an adversarial closer. Theo’s PR audit was good not because it ran a tournament but because it had a verify phase β€” an adversarial checker per ruling that tried to refute it against the real repo. Strip that phase and the tournament is just expensive agreement. The structure β€” audit, rule, verify, with priors loaded from memory β€” works at a tenth the agent count. As we put it in The Cognition Dimension Ladder13: whoever spends the most tokens wins β€” provided they are spending inside an apparatus that knows what to reject. The load-bearing word is discipline, not burn. The token-maxing is the theater. The apparatus is the asset.


Run the field guide on a loop you operate today

This is a test you can apply in five minutes. Take one loop you already run.

  1. Name its trigger. Manual, clock, event, or agent-initiated? (The cheap axis β€” answer it and move on.)
  2. Name where its state lives. Your head, fixed code you wrote ahead of time, or a durable external medium? This is the axis that matters.
  3. Kill it mid-run, in your imagination. If the agent died right now, does the state survive? If the answer is “no,” the loop is held together by attention, not architecture.
  4. Find the thing that can say no. Is there an external check β€” a test, a separate evaluator, a red CI check, a revertible diff β€” or is the loop grading its own homework?
  5. Apply the compounding test. Does this loop’s output land in a pile you read, or does it become a structured prior the next loop reads from? If it only accumulates, you have built the obvious half.

If you cannot answer 2 through 5, the loop is not yet an asset. It is an activity. The whole shift Steinberger named β€” from prompting agents to designing the loops that prompt them β€” only pays off when the loop’s state machine lives somewhere durable, something external can say no, and the output compounds into the substrate the next run reads. Get those right and you can leave the leash long, because the loop is holding itself. Get them wrong and no amount of autonomy, or agents, or tokens, will save you from a loop agreeing with itself on repeat.

Design the loop. But spend most of your design budget on the part the discourse keeps skipping: not what starts it, but who holds its state machine β€” and whether the spend leaves an asset behind.

Building loops for a compounding wiki? The substrate these loops feed β€” the self-cleaning wiki-graph, the Janitor, claims-and-edges over chunks β€” is the subject of The Index Is the Data. Read it next, then come back and run the compounding test on your own loop.

References

  1. Peter Steinberger (@steipete). Tweet, June 2026 (~6.5M views) β€” “Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Follow-up: “Tell codex to maintain your repos, wake up every 5 minutes and direct work to threads.” https://x.com/steipete/status/2063697162748260627
  2. Boris Cherny, Head of Claude Code at Anthropic β€” at Acquired Unplugged (presented by WorkOS), June 2, 2026: “I don’t prompt Claude anymore. I have loops that are running… they’re the ones prompting Claude and figuring out what to do. My job is to write loops.” https://www.youtube.com/watch?v=RkQQ7WEor7w Β· https://x.com/bcherny/status/2044632185354006713
  3. Addy Osmani, “Loop Engineering” β€” Google engineer credited with naming the wave. https://addyosmani.com/blog/loop-engineering/
  4. Geoffrey Huntley, via The Register (2026-01-27) β€” coined the “Ralph” loop, July 2025: “Ralph is a Bash loop”; while :; do cat PROMPT.md | claude-code ; done; “a bash loop that feeds an AI’s output (errors and all) back into itself until it dreams up the correct answer. It is brute force meets persistence.” Also “deterministically bad in an undeterministic world” and “sit on the loop, not in it.” https://www.theregister.com/2026/01/27/ralph_wiggum_claude_loops/
  5. Scott Farrell, LeverageAI. “Breaking the 1-Hour Barrier” β€” stateless workers + stateful orchestration; state persists externally; “your 1-hour ceiling is an architecture choice, not a model limitation”; the minimum viable meta-loop (external state, completion criteria, checkpoints, context hygiene). https://leverageai.com.au/breaking-the-1-hour-barrier-ai-agents-that-build-understanding-over-10-hours/
  6. Andrej Karpathy, “Software Is Changing (Again),” Y Combinator AI Startup School, June 2025 (transcript) β€” “Depending on the complexity of the task at hand, you can tune the amount of autonomy that you’re willing to give up for that task.” “this is the autonomy slider. We can build augmentations or we can build agents.” “this is one way of keeping the AI on leash… and the AI is not getting lost in the woods.” “It’s less Iron Man robots and more Iron Man suits that you want to build.” (Quotes from the talk recording.) https://www.youtube.com/watch?v=LCEmiRjPEtQ Β· https://www.ycombinator.com/library/MW-andrej-karpathy-software-is-changing-again
  7. Andrej Karpathy. Tweet, June 18, 2025 β€” “LLMs are a new kind of computer, and you program them in English.” (Software 3.0.) The catchphrase “the hottest new programming language is English” is from his earlier tweet, January 24, 2023. https://x.com/karpathy/status/1935518272667217925 Β· https://x.com/karpathy/status/1617979122625712128
  8. Theo Browne (t3.gg). Video transcripts on agentic loops β€” “You can never be more dynamic than code… when the agent can write code, it is effectively building its own custom feature every time.” Also: the ~240-line throwaway-JS PR-audit workflow that defined its own phases/schemas/pipeline; “What if we put Claude Code in a while loop?”; two-agents-unaware-on-one-PR; the keep-alive cron. (Transcribed from Theo Browne’s videos; treat figures/phrasings as transcript-grade.) https://www.youtube.com/watch?v=iJVJwmCKW9o Β· https://www.youtube.com/watch?v=wwfJlSF34n8 Β· https://www.youtube.com/watch?v=FDxW2bfBOWE Β· https://www.youtube.com/watch?v=3sTu8sSLVfg
  9. Scott Farrell, LeverageAI. “The Agent Token Manifesto” (Self-Improving Loops) β€” “write code that writes code, build tools that build better tools”; agents build their own capabilities; hypersprints. https://leverageai.com.au/
  10. @mosyaseen. Reply under Steinberger’s tweet β€” “designing the loop is half of it. the other half is putting something in the loop that can say no: a test, a type check, a real error. a loop with nothing to push back is the agent agreeing with itself on repeat.” https://x.com/steipete/status/2063697162748260627
  11. Scott Farrell, LeverageAI. “The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG” β€” claims-and-edges over chunks; the Ingestion + Janitor dual-agent engine; “the index didn’t just get smaller. It got smarter with every janitorial pass”; the Janitor as reflection (per Generative Agents, Park et al., UIST 2023). https://leverageai.com.au/the-index-is-the-data-how-a-self-cleaning-wiki-graph-out-thinks-rag/
  12. Andrej Karpathy. “LLM Wiki” gist (April 2026) β€” “The LLM incrementally builds and maintains a persistent wiki β€” a structured, interlinked collection of markdown files”; “the knowledge is compiled once and then kept current, not re-derived on every query.” https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
  13. Scott Farrell, LeverageAI. “The Cognition Dimension Ladder” β€” disciplined cognition: “whoever spends the most tokens wins β€” provided they are spending them inside an apparatus that knows what to reject”; the chooser at Dimension 4. https://leverageai.com.au/the-cognition-dimension-ladder-why-your-ai-strategy-is-one-rung-too-low/

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Β© 2026 Leverage AI, Scott Farrell. All rights reserved. This content is made available on a limited, revocable, read-only basis only. No licence or right is granted to copy, reproduce, republish, scrape, store, adapt, summarise, index, embed, or use this content to create derivative works, work product, deliverables, methodologies, training materials, prompts, templates, software, services, research, or commercial outputs, whether by humans or machines, without prior written permission. This restriction includes internal business use, client work, consulting, advisory, implementation, and any use in or for artificial intelligence, machine learning, data extraction, retrieval, evaluation, fine-tuning, or knowledge-base construction.