The Genius and the Form
You took your smartest tool, handed it a form and a ten-page manual on how to fill it out — then wondered where the genius went.
The argument in three lines
- •The world changed under the prompt. The prescriptive style — tasks, formats, a wall of “don’t”s — was right for GPT-3 and early GPT-4. On models that can think, it now caps the output instead of raising it.
- •A prompt used to be a specification. Now it’s an orientation. Give the model a clear north star, a short fence of genuinely load-bearing constraints, and the latitude to be good — then get out of the way.
- •Loose is not vague, and rigour doesn’t vanish — it moves. A too-loose north star produces confident nonsense; the craft is tight intent, loose method. Governance shifts from the input leash to the audit.
Picture your brightest staff member. Genuinely sharp, creative, sees around corners. Now sit them down, hand them a form, and then hand them a ten-page document on how to fill out the form. Put the date here. Format is DD/MM/YYYY. Don’t put it in the wrong box. Here’s what each field means. Here are twenty things not to write.
And then you’re surprised when you get back a perfectly-filled-out form with no insight in it. No brilliance. No “hey, I noticed something weird in the Q3 numbers.” Of course there’s no insight — you didn’t ask for insight, you asked for a form. You took your smartest person, turned them into a data-entry clerk, and then wondered where the genius went.
The genius went into getting the date format right. Because that’s what you signalled mattered.
I’ve changed my mind about how you’re supposed to prompt these models, and this book is the write-up. The short version: I used to write prompts like a list of tasks. Do this, then that, here are the parameters, here’s the format, here are all the things you must never do. And it worked — back on GPT-3 and early GPT-4, you kind of had to. The model wasn’t that bright, so you thought for it. You handed it a very narrow lane and basically drove the car yourself, with the model as the steering wheel.
A concession, up front
But the models changed. GPT-5.5, Claude 4.7, 4.8 — these things are genuinely brilliant now. And my old prompt style is actively hurting the result. I’m hemming the model in. I’m taking something that could give me a surprising, insightful answer and forcing it to colour inside the lines of my crappy little form.
Name the shift: specification, then orientation
Here’s the shift, named plainly. The old model of prompting is specification: you, the human, hold the intelligence, and the prompt is the spec you hand to a fast-but-dim executor. Every rule, format and don’t is you doing the thinking up front because the model can’t.
The new model is orientation: the model holds the intelligence, and the prompt’s job is not to think for it but to aim it. You give it a north star — the goal, the why, who it’s for, what “good” and “wrong” mean — hand it the tools, and get out of the way.
A prompt used to be a specification. Now it’s an orientation. You used to write the instruction manual. Now you write the mission.
Two ways to hand work to a model
| Specification (old) | Orientation (new) | |
|---|---|---|
| Who holds the intelligence | The human, in the prompt | The model |
| The prompt’s job | Think for the model | Aim the model |
| Its shape | Tasks, formats, definitions, don’ts | Goal + a short fence + latitude |
| Right for | GPT-3 / early GPT-4 | GPT-5.5 / Claude 4.7–4.8 |
| When misapplied | (was fine) | Caps a smart model’s output |
This table is the spine of the book. Every later chapter is one column or the other, aimed at a different job.
Does the posture really beat the model?
It can — and there’s a clean number for it. In Andrew Ng’s well-known result on a coding benchmark, GPT-3.5 scored 48.1% on its own; GPT-4 on its own did better at 67.0%. But wrapped in an iterative agent workflow — given room to plan, draft, critique and revise toward a target — GPT-3.5 reached 95.1%, sailing past raw GPT-4.1
The posture, not the model, did the work
GPT-3.5, on its own
GPT-4, on its own
GPT-3.5, oriented in an agent loop
Worth saying out loud because it’s mis-quoted everywhere: the famous 48→95 jump is GPT-3.5, not GPT-4.
The lesson isn’t “agents are magic.” It’s that how you orient a capable model toward an outcome can outweigh a whole generation of raw capability. Aim it, don’t cage it.
Key Insight
The model got smart while your prompt stayed dumb. The fix isn’t a better specification — it’s a clearer mission.
Who this is for
This book is for two kinds of reader. The first writes prompts for a living — engineers, agent builders, anyone who spends their day getting work out of these models. The second is a technically-literate leader who hears “AI governance” and reaches for a longer rulebook. By the end, both should be able to do four things: tell which parts of a task to prescribe and which to hand over; cut the dead don’ts; write a tight-intent, loose-method north star; and move their rigour from input-constraints to output and path verification.
None of that is a style preference. On a model that can think, specification is a ceiling. The rest of Part I shows you exactly why — and it isn’t taste, it’s mechanism.
A Denial-of-Service on Intelligence
Why detail backfires now — and it isn’t taste, it’s the architecture.
Tell a smart AI to stay in a narrow lane and it will stay in the lane. It’s obedient; it’s a good employee. But it turns off most of its brain to do it. Or worse: most of its thinking gets consumed by the minute details of your prompt — parsing your ten rules, satisfying your format, remembering your seventeen don’ts — instead of thinking about the problem. You’ve spent its intelligence on compliance instead of on the work.
That sounds like a metaphor. It isn’t. There’s a mechanism underneath, and it’s precise.
Key Insight
A modern model has a roughly fixed cognitive budget per response. Every token of constraint, definition and prohibition is budget spent on parsing and complying instead of on the problem. Over-specification doesn’t make a smart model safer; it makes it dumber.
The metaphor is literally the architecture
You don’t have to take my word for the “budget” framing. The makers of these models describe it the same way. Anthropic’s engineering guidance states plainly that large language models have an “attention budget” they draw on when parsing context, and that context “must be treated as a finite resource with diminishing marginal returns.”2
The reason is architectural. A transformer lets every token attend to every other token, which is n² pairwise relationships for n tokens, so as the context grows “a model’s ability to capture these pairwise relationships gets stretched thin.” This is exactly what we mean in our own work on context engineering when we say context is attention, not just capacity — cluttered context diffuses focus and degrades output even when you’re nowhere near the token limit. Every “do not” and every redundant definition is drawn from the same finite pool the model needs for thinking.
Is this measurable, or just plausible?
Measurable. The peer-reviewed “Lost in the Middle” study found that models systematically fail to use information buried in the middle of a long context: performance is highest when the relevant material sits at the start or end, and “significantly degrades” when the model has to reach into the middle.3 Chroma’s 2025 “Context Rot” tests put eighteen frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — through the same wringer and found every one of them grew “increasingly unreliable as input length grows,” even on simple tasks, well inside their advertised windows.4
A million-token window is not a licence to fill it.
Pile in seventeen don’ts and the one rule that actually matters is now buried in the middle, competing for attention with sixteen rules that don’t. You didn’t add safety. You added noise, and you hid your own signal inside it.
The same shape shows up in agents
It isn’t only a prose-prompt problem. Anthropic notes that when an agent is wired to thousands of tools, it “will need to process hundreds of thousands of tokens before reading a request.”5 Benchmarks like MCPVerse confirm the direction: agent performance degrades as the action space grows, because the model has to find the right tool in a bigger haystack.6 More descriptions of capability up front means less budget for the actual job. More rules, same story.
Over-prompting a smart model is a denial-of-service attack on its intelligence.
Where the budget goes
Spent on you
- • Parsing your rules
- • Satisfying your format
- • Policing itself against your don’ts
- • Re-reading definitions of things it already knows
Spent on the work
- • Understanding the actual problem
- • Finding the non-obvious angle
- • Producing the insight you couldn’t pre-write
- • Surfacing the answer you didn’t know to ask for
Every line you add to the prompt moves spend from the right column to the left.
So when you tell a smart AI to stay in a narrow lane, you’re prompting GPT-4.8 like it’s GPT-3.5 — and you get GPT-3.5 behaviour back, because you told it to think small.
Myth vs Reality
Myth: more context and more instruction make the output safer and better.
Reality: past a small, load-bearing core, more means diffused attention, which means worse — and that’s measured, not asserted.
One honest caveat before the next chapter: this isn’t an argument for empty prompts. Intent detail and genuinely load-bearing constraints earn their budget. Procedure, don’ts, and definitions of the obvious don’t. We’ll make that cut precise later. But if over-specification taxes the budget, one ingredient taxes it worst — and most counter-productively of all.
The Most Destructive Ingredient
Of everything in a modern prompt, the wall of “don’t” is the part actively working against you.
Here’s a small confession. I needed an agent to understand that a particular wiki page’s job is not to answer the user’s question — the page exists to surface the right source, not to resolve the query. My old instinct kicked in immediately: write “do NOT answer the question.”
And then I caught myself. What I actually wrote in the margin was: it’s definitely NOT answering the question — that’s the no-no — but I don’t like negative prompting, so keep it non-prescriptive, hence the north star. So instead of the prohibition I stated the positive purpose: “this page exists to make the source relevant to the query so the agent can gauge whether it’s appropriate.” And the don’t took care of itself.
Takeaway
Reframe the prohibition as a purpose. State what the work is for, and most of your don’ts dissolve — because a model pointed at the right goal doesn’t need to be forbidden the wrong ones.
Of everything in a modern prompt, the single most destructive ingredient is the wall of “stop / don’t / never.” More destructive than verbose definitions, more than rigid formatting — because it doesn’t just waste budget, it actively works against you. Three things go wrong at once.
Three ways a don’t-list backfires
1. You spotlight the forbidden thing
Naming X, Y and Z puts them front and centre in the model’s attention. You’ve made the thing you don’t want the loudest thing in the room.
2. You make it police itself
The model now spends budget checking its own output against your list instead of doing good work — the denial-of-service from Chapter 2, self-inflicted.
3. You’re solving an old model’s problem
Most don’ts are things a smart model wouldn’t have done anyway. You’re fixing GPT-3.5 behaviour on GPT-4.8 — and, by telling it to think small, you summon exactly that behaviour.
Is this just my preference?
No. Even the model-makers’ own documentation says the same thing. OpenAI’s prompt-engineering best practices tell builders to do the opposite of a don’t-list: say what to do instead of what not to do. Their worked example replaces a stack of prohibitions with a single positive instruction.7 When the people who built the model treat the don’t-list as the weaker pattern, this isn’t a fringe take — it’s the house style of the people who know the model best.
The sibling sin: explaining the obvious
The don’t-list has a twin, and it’s just as expensive. You don’t need to tell the model what a customer is, or what a CSV looks like, or that an email has a subject line. It knows.
Every sentence you spend explaining the obvious is a sentence where the model does your remedial reading instead of being smart.
Forbidding the improbable and explaining the obvious are the same mistake wearing two coats. Both spend a finite budget on things a capable model has already handled, and both leave less for the part you actually hired it to do.
The don’ts are a fossil. They’re left over from when models were dumb and you had to fence them in. On a modern model, the fence is the problem.
Pitfall: negative prompting as a reflex
“But some don’ts are real”
They are — and they aren’t what this chapter is about. A genuinely safety-critical rule isn’t don’t-list filler; it’s a hard constraint, stated flatly, and there are only ever a few of them. The teardown here is aimed at the reflexive don’ts — the fifteen prohibitions you added because adding them felt responsible. Keep the two or three that would make the output wrong if broken. Delete the rest. (Part III gives you the test for telling them apart.)
That’s the negative case made — what to stop. Now the positive proof. In the next chapter, watch a single image prompt go from an unusable wall of text to a designed infographic, with no change to the model and no extra instruction — just a change in what we told it the work was for.
The Card That Designed Itself
One image prompt, three framings, three very different results. The evolution is the argument.
The clearest proof I have isn’t a theory — it’s an image generator. We make quote-card images for social posts, and the hardest case is a structured “trace” diagram: a side-by-side flow showing how plain retrieval and a wiki-graph each handle the same support ticket. Arrows, indentation, links, code-style comments. It’s easy to render this kind of thing dead; hard to render it well. So it’s the perfect test of specification versus orientation.
Same model throughout. Three prompts. Watch what changes.
Three framings of the same card
Attempt A — prescriptive: flatten the text
Sent the model the quote text, told it to typeset the words on a card.
Result: accurate, but an unusable wall of text. It flattened the two-column flow into one linear paragraph. Technically faithful, visually dead. We asked it to be a typist; it was a good one.
Attempt B — prescriptive: reproduce the HTML
Sent the source HTML: “render this faithfully, preserve the structure.”
Result: better — kept the diagram, the syntax colours, the arrows. But it was still transcribing, reproducing the source’s look including its flaws (wide lines clipped off the edge). A more careful typist. Still no design judgment, because we hadn’t asked for any.
Attempt C — the north star
Same HTML, reframed completely: “North star: make it visually interesting — something that earns a scroll-stop. The HTML is provided only to show the original structure and meaning. Understand it, then design a new, more interesting card that honours it. Hard constraints: keep the words exact; everything must fit (wrap, never clip); house look. Beyond those, be creative.”
Result: a genuinely designed infographic.
Attempt C came back as something neither A nor B could have produced: two columns with icons, a “VS.” badge, the retrieval path dead-ending in a red cross, and the wiki path flowing along a winding road to a flag. It invented a visual metaphor for “this one leads somewhere better — a better map.” Words exact. Nothing we asked for line-by-line; everything we actually wanted.
Key Insight
The unlock wasn’t a better description of the card. It was telling the model what the card was for — and showing it the structure, not the styling.
Was it a fluke for that one input?
No. We ran the same north-star prompt on a boring plain-paragraph quote — the kind of input that has no structure to honour — and it didn’t leave it boring. It rendered the four key verbs of the quote as a small knowledge-graph motif with the line in serif beside it. One prompt, two very different inputs, both lifted — because the instruction was “make it interesting,” not “draw a navy rectangle.”
A prescriptive prompt asks the model to be a good typist. A north-star prompt asks it to be a good designer. Same model — you choose which one shows up.
It isn’t just an image trick
If you don’t make images, the same move works on prose, and it’s where you’ll feel it. Take the over-specified analysis brief — the one with the headings pre-decided and the format locked — and replace it with a north star:
“Our company prides itself on customer satisfaction. The service department’s customer-sat is declining. Go look. Find the causes. I want insights and recommendations. What other data do you need? Here are the tools — ticketing, CRM, last year’s surveys. Write it up however makes the point best; free text is fine.”
That’s the whole prompt.
And you’ll get something a form could never produce — maybe “the problem isn’t the service department, it’s that sales is over-promising delivery dates and service eats the complaints.” A pre-structured report would never surface that, because you didn’t leave a box for it.
You have to leave room for the answer you didn’t know to ask for.
Attempt C didn’t win on a lucky sentence. It won on a shape — a way of arranging intent, context and constraint that you can lift off this card and drop onto anything. That shape is the next chapter.
The Reusable Shape
Strip the case study to its skeleton and you get a template that works on a card, a report, a summary, or a piece of code.
The card in Chapter 4 didn’t win on a clever sentence; it won on a structure. Pull that structure out and you have a template — the thing to screenshot, the thing the rest of this book aims at different jobs.
Read each slot for what it’s doing. The role orients without over-defining. The north star states the outcome and purpose, never the method, in one vivid sentence. The context is explicitly tagged as something to learn from, not to copy. The hard constraints are the two-to-five things that would make the output wrong — stated flatly (we’ll spend a whole chapter on getting this slot right). And the latitude is handed back in writing.
The three clauses that did the work
1. “For structure reference only, not to be shown literally”
When you hand over source material, tell the model whether to transcribe it or learn from it. That one clause turned “reproduce the HTML” (Attempt B) into “understand the HTML” (Attempt C). Without it, models default to copying — and copying reproduces the flaws.
2. Name the goal’s purpose, not its properties
“Earns a scroll-stop in the feed” is optimisable — the model can plan toward it. “Colourful, navy background, serif quote” is just a checklist. Purpose is a target; properties are a cage.
3. Put the freedom in writing
“Be creative — deliver something more interesting than the plain original” actually changes behaviour. Models are conservative by default and will under-design unless you license them not to. Latitude unstated is latitude unused.
What does this look like on a real prompt?
Properties vs purpose
A properties prompt
“Navy background, serif quote in 48px, amber accent top-left, kicker, footer logo.”
→ a competent, average card. You described one card; you got it.
A purpose prompt
“Make it striking — something that earns a scroll-stop; words exact; must fit; house look.”
→ a layout you wouldn’t have thought to ask for.
The lesson under the lesson: a spec encodes the average; a north star unlocks the best. The model has seen millions of well-designed cards and you’ve described one. Your spec is a lossy compression of “good,” and the model can often beat your compression — if you let it.
Bottom Line
Name the purpose, not the properties. Tag your context as learn-from or copy. Keep the hard constraints to a handful. License the creativity out loud. Everything else, leave off.
There’s a slot in this template we waved past: Hard constraints. It looks small, but it’s where the whole thing lives or dies — because a north star with no real constraints isn’t freedom, it’s negligence. Part III starts exactly there.
A North Star Is Not “No Rules”
The dangerous misreading of this whole idea — and the structural answer to it.
Let me kill the naive version of this idea before it kills your output. The dumb reading is “stop giving instructions, let the AI cook.” That is not the lesson, and I want it dead on the page.
A loose north star isn’t freedom. It’s negligence.
I learned this building a Janitor agent for a wiki-graph — the component that compacts and merges claims over time so a knowledge base stays coherent instead of bloating into a swamp. Give the Janitor too loose a directive — “tidy up the wiki” — and it hits a failure mode I call hallucinated consolidation: it merges two genuinely different ideas into one false claim, quietly erasing a real distinction. That’s the whole risk of vagueness in a phrase. Not chaos. Confident, invisible erasure — the worst kind, because nothing looks broken.
So the north star isn’t the absence of direction. It’s a different kind. The structure that makes it work is a split — between the few things that must hold, and everything else.
The constraint / latitude split
Hard constraints
The few things that, if broken, make the output wrong — not just different. For the cards: the words must be exact; it must fit the frame; brand colours. Non-negotiable, stated flatly.
Creative latitude
Everything else: layout, metaphor, hierarchy, emphasis, flourish. Explicitly handed back — “beyond those constraints, be creative.”
The art is keeping the constraint list short and load-bearing. Every line you add is latitude you take away.
How do you know which rules to keep?
There’s one test, and it’s the most practical line in this book. Hold each rule up and ask:
“If the model ignored this, would the result be wrong, or just not how I pictured it?” If it’s the latter — cut it. That’s not your call to make; it’s the model’s room to be good.
“The words must stay exact” — break it and the card is wrong. Keep it. “Put the kicker top-left” — break it and the card is merely different, maybe better. Cut it. Run your whole prompt through that filter and most of it falls away. What’s left is the fence.
Watch me cut
You can see me applying the test to my own prompts, line by line:
• “‘translate every proposed change; drop nothing, add nothing’ → too prescriptive.”
• “don’t have rules like ‘chapters don’t support → drop’ — review appropriateness against the north star instead.”
• And the funniest, caught mid-prompt: “so much for non-prescriptive: ‘You may sharpen the wording of the headline. You may not soften the claim or swap the thesis.’ → remove.”
Every bullet is a small fence. The art now is taking fences away.
Doesn’t a short fence just mean a weaker prompt?
No — because the same north star re-aims with the job, and that’s where its power is. Point the wiki engine at a personal inbox and the directive means “build a consistent worldview.” Point it at an adversarial research corpus and the directive means “never silently delete a contradiction.” Same shape, different purpose — and the purpose is what makes the judgment correct. A rulebook can’t do that; it just sits there listing edges. (It’s telling that Andrej Karpathy, sketching his own “LLM Wiki,” reached for a periodic lint pass to catch contradictions and stale claims — near-identical in spirit to the Janitor. Independent minds, converging on the same shape.8)
North star = one goal + a short fence + freedom inside the fence. Not a spec, not a void.
Key Insight
The North Star gives the work its meaning. A rulebook just gives it edges to bump into.
This raises an obvious tension with something I’ve argued before — that you must specify precisely, because a model can only compile what you give it. If loose isn’t vague, how is it also not a spec? The next chapter resolves it in four words.
Tight Intent, Loose Method
“But you told me AI can only compile what I specify.” Both are true. Here’s why.
If you’ve followed my earlier arguments, this looks like a flat contradiction. I’ve said elsewhere that AI is an intention compiler — it can only compile what you specify, so vague intent compiles to generic output. Now I’m telling you to stop specifying. So which is it?
Both — because they’re about different things. The thing to be precise about is intent. The thing to stop over-specifying is procedure. A north star is high-information about purpose and low-information about method. The old prompt is the reverse — drowning in procedure, silent on purpose — which is exactly why it gets you procedure-following with no purpose-serving.
Tight intent, loose method.
Read the intention compiler correctly and there’s no contradiction at all: compile a sharp intent, not a sharp procedure. Same principle, applied to the right layer.
Where to spend your precision
| Be TIGHT on (intent) | Be LOOSE on (method) |
|---|---|
| Purpose — what “great” looks like | The step-by-step procedure |
| Audience — who it’s for | The exact format / structure |
| What success and failure look like | The list of don’ts |
| The 2–5 load-bearing constraints | Explaining what the model already knows |
So “loose” just means vague?
The opposite. Loose method demands tighter intent, and the model-makers’ own guidance proves it. OpenAI notes that its newer models “follow instructions more closely and more literally” than their predecessors, which used to “more liberally infer intent.”9 The flip side, in their words: “a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model.” That is the north star, validated from the other direction — one sharp sentence of intent beats a paragraph of hedged procedure. Orientation isn’t “fewer words for their own sake.” It’s precision relocated from procedure to purpose.
Most bad prompts get this backwards: vague on purpose, suffocating on procedure. Want the opposite.
When to stay prescriptive
North-star prompting is the right tool for tasks with a wide solution space and a taste-based target — design, writing, synthesis, “make this good.” It’s the wrong tool elsewhere, and pretending otherwise would make this whole book naive. Stay prescriptive when:
- 1. The output feeds a machine and must match an exact schema or format — specify it hard.
- 2. There’s a single correct answer — a calculation, a lookup. Orientation adds nothing.
- 3. A constraint is safety- or correctness-critical — then it’s a hard constraint, stated flatly, not latitude.
Key Insight
The skill is sorting which parts of a task are “must be exactly X” (prescribe) from “must be good” (north star) — and not accidentally prescribing the second kind.
Most over-prompting is precisely that accident: dictating the creative parts because dictating feels safer, when it’s what’s holding the quality down. The cut-or-keep test from Chapter 6 is how you sort them — apply it per requirement, not to the prompt as a whole.
“But I need valid JSON”
You hear this objection constantly, and it has a clean answer: don’t crush the thinking step into a rigid schema just because a downstream step wants structure. Separate them. Let the smart pass be smart and messy; let a dumb, cheap, deterministic pass tidy it up afterwards. Intelligence first as free text; structure second as a cheap transform. You do prescribe the JSON — in the second pass, where prescription belongs. That’s the heart of not making one prompt do two jobs at once.
The doctrine is now complete on paper. But does it survive contact with reality — with cheap models and ordinary jobs, not frontier showpieces? The next chapter is nothing but field receipts.
When Less Prompt Wins
Field notes from real runs — including the one where a cheap model got better the moment I stopped instructing it.
Here’s the objection this chapter exists to kill: “Sure, loosen the prompt for a giant frontier model — but my use case runs a small, cheap model, which surely needs more hand-holding, not less.” It’s the most reasonable-sounding objection in the book. It’s also backwards.
The Haiku reversal
I was getting bad output from a cheap model — Haiku — on a dental-market report. My instinct was the universal one: add more instructions. Wrong fix. What I actually wrote to course-correct was this:
“I’d like the prompt to be less prescriptive, allow the model more freedom to use its own ideas. Present the data and the benchmark, and ask it to comment. The over-prescription was tripping up Haiku — make the data more clear, as opposed to telling it what to do. Give it all the data, and let it write about it.”
The over-prescription was causing the small model to choke. The fix wasn’t more rules — it was clearer data and fewer instructions. That’s the counter-intuitive heart of the whole thing: when a weaker model struggles, the reflex is to constrain harder, and that’s often exactly backwards.
The receipt came the next session, in one line: “the less prescriptive run, on medium, is pure gold compared to what we had before.” Less prompt, smaller model, better output. Read that twice — it breaks the intuition that capability and instruction trade off in your favour.
The less prescriptive run, on medium, is pure gold compared to what we had before.
Three more from the logbook
Let it write its own rubric
Scoring some models blind, I told the agent to generate its own scoring criteria — only light prompting, let it do the work, not me. A frontier model often builds a better rubric than you would. This is orientation applied to judgment itself: you hand over not just the answer, but the standard for judging it.
Don’t pre-decide the findings
Writing a research-report prompt, I cut my own conclusions: “I’d rather remove prescriptive findings — that’s the agent’s job once it considers the data. We can have loose suggestions, ideas to consider, but the conclusions are the work.” The conclusions are the work — don’t steal them from the model. Suggestions, loose; conclusions, left to the model.
No fixed word counts
Generating one-line summaries for a knowledge graph: “give the model a north star, don’t ask for a specific number of words.” The model knows how long a good one-liner is. A word count just forces padding or truncation — it degrades the thing it was meant to control.
Four notes, one move: keep shifting work off brittle deterministic heuristics and onto the model’s judgment. It’s the same instinct behind nuking and regenerating outputs rather than nursing them — because AI commoditised thinking; it didn’t commoditise judgment, and “interesting,” “good,” and “the right finding” are all judgment.
Remember
These wins are on taste-based, wide-solution-space tasks — commentary, scoring, synthesis, summaries. For a calculation or a schema, the boundary from Chapter 7 still holds: prescribe.
The Haiku lesson isn’t “never instruct a small model.” It’s “don’t drown a struggling model in procedure when the real problem is murky data.” Same instinct — now point it the other way, at the context you feed the model rather than the instructions you give it.
Read Widely, Then Decide
The same instinct, pointed not at the instructions you give the model — but at the context you feed it.
Classic retrieval spends all its cleverness before the model sees anything: chunking, embeddings, re-rankers, endless agonising over “did I fetch exactly the right eight paragraphs?” Look at that instinct and you’ll recognise it — it’s the same control-freak reflex as the over-specified prompt. Don’t trust the model; pre-digest everything; hand it a tiny, pre-approved window. On the instruction side it’s a wall of don’ts. On the data side it’s a hand-tuned pipeline. Same distrust, different surface.
The move that beat it, for relationship-heavy corpora, was almost embarrassingly simple: stop minimising what the model reads. Load the whole territory and let the model’s attention do the selecting. The hard part — relevance judgment — happens inside one forward pass, instead of in a brittle pipeline you tuned by hand.
Reading widely, even back to the source, and then deciding isn’t laziness — it’s the correct method for a smart reader. The instinct to pre-chew the data was right when the reader couldn’t be trusted to chew for itself. It isn’t anymore. (It’s a sign of where this is going that Andrej Karpathy independently sketched an “LLM Wiki” — claims and edges, with the model doing the bookkeeping and a periodic lint pass to keep it coherent — near-identical in shape to the self-cleaning wiki-graph we’d been running.8)
There’s a way to do this that respects Chapter 2’s warning: you orient the agent with the terrain map — the root pages and one-line descriptions — as its first context, so it navigates rather than searches. You load the map, not the swamp. That’s a subject of its own; here it’s the proof that the same north-star instinct works on the input axis too.
Move the intelligence out of your brittle pre-processing and into the model, because the model is now the smartest thing in the loop. It didn’t used to be. That’s what changed.
Two axes, one move
| Axis | The old, distrustful move | The orientation move |
|---|---|---|
| Instructions | Pre-think the procedure (a full spec) | Give intent, let it reason |
| Context | Pre-digest the data (minimal retrieval) | Give it the territory, let attention select |
Both relocate the intelligence from your pipeline into the model. That’s the doctrine; this chapter is just the second place it shows up.
Don’t over-read this
If you trust the model with both the instructions and the context, one fair fear remains: how do you keep it honest? You can’t eyeball a rigid output checklist anymore. That’s the last chapter of the doctrine — and the one your most cautious stakeholder has been waiting for.
From the Leash to the Audit
If you let the model roam, how do you stay in control? You don’t lose control. You move it.
This is the chapter for the reader — often a leader — who hears “drop the don’ts, give it latitude” as “abandon control.” It isn’t. The obvious objection is real: if you let the model roam — loose method, broad context, room to surprise you — how do you know it did well? You can’t eyeball a rigid output checklist anymore. But the answer isn’t to put the leash back. It’s to put your rigour where it actually holds.
Test the path, not the answer
Testing the wiki agents, I found you stop testing the answer and start testing the path. What did it read? Which sources did it go to? Did it hit the canonical page or wander off to a tangent? How did it navigate? The trace of how it thought is more inspectable — and more stable — than the final text, because the text is partly a function of how sharp the model is on the day. The navigation is the reasoning made visible.
Key Insight
Governance doesn’t disappear when you stop micro-managing the prompt. It moves — from “did it obey my rules?” to “did it reason well, and can I see the reasoning?”
Why the input-leash was never going to hold
Here’s the uncomfortable part for anyone who thinks a rule in the prompt is a control: prompt-level rules are bypassable by design, and the numbers are brutal. Reinforcement-learning “investigator” agents jailbreak today’s frontier models — GPT-5, Claude Sonnet 4, Gemini 2.5 Pro — at success rates of 78 to 92 per cent on high-risk behaviours.10 Cisco ran fifty standard jailbreak prompts at DeepSeek R1 and got a 100 per cent bypass rate — every safety rule in the prompt, gone.11 Prompt injection sits at the very top of the OWASP risk list for LLM applications for exactly this reason.12
A rule in the prompt is etiquette, not enforcement
jailbreak success against GPT-5, Claude Sonnet 4, Gemini 2.5 Pro (Transluce)
bypass of DeepSeek R1’s safety rules across 50 prompts (Cisco / HarmBench)
If a don’t-list can be talked around four times out of five, it was never a control. It was a polite request. The hard backstop sits outside the model: scope what it can reach, enforce policy in deterministic code, make the unwanted action structurally impossible rather than merely forbidden. Prompts are manners; architecture is physics — and can’t beats shouldn’t every time.
Manners vs physics
Manners (the leash)
- • Prompt rules, don’t-lists, “please don’t”
- • Probabilistic — bypassable (see the rates above)
- • Works when everything is fine; fails when it matters
Physics (the audit)
- • Scoped permissions, deterministic gates, path traces
- • Structural — enforced, not requested
- • The model proposes; a cheap, reliable layer checks
Where the rigour goes instead
You don’t make a big AI decision governable by cramming rules into the prompt. You make it governable by having it show its working — propose freely, emit its reasoning as small, checkable claims — and then checking that working with deterministic code. The model proposes, freely and smartly; a cheap, dumb, reliable layer checks. You don’t lose the audit when you drop the don’ts; you relocate it, from input-rule-obedience to path-and-output verification.
Don’t put the governance in the leash. Put it in the audit.
When no human is in the loop
Push this far enough and the north star stops being a prompt and becomes a file. On one project the directive lives in a NORTH_STAR.md that an hourly autonomous loop reads first, then does one north-star-aligned “bite” of work and stops. State the destination clearly enough and a self-directing agent can navigate without a turn-by-turn route — which is the whole game once the unit of design shifts from the one-shot prompt to the loop, with its durable state living outside the agent where the path stays inspectable.
The doctrine is complete now: aim it, fence the load-bearing few, and audit the path. What’s left is the mindset that makes all of it click into place — the difference between managing a clerk and leading an expert.
Lead the Expert
We spent five years learning to manage clerks. The clerks turned into experts.
Step back from all of it and the change is a mindset, not a technique. So let me put the uncomfortable version on the page.
Your prompt is a confession. A wall of don’ts confesses you don’t trust the model — or that you’re still prompting the model you used three years ago.
And its sibling: if your prompt would insult your smartest employee, it’s insulting your model too — and you’ll get insulted-employee work back. Hand someone a form and a list of twenty prohibitions and you’ve told them, in writing, that you expect very little. They’ll meet the expectation precisely.
The skill moved. The old skill was writing a complete specification. The new skill is writing a clear north star and tolerating a surprising answer. The work is now in the aiming, not the fencing.
The spec was never the quality. It was the ceiling.
The whole argument, in one breath
The model got smart while your prompt stayed dumb (Ch. 1). Every rule is a tax on a fixed attention budget (Ch. 2), and the don’t-list is the worst offender (Ch. 3). Orientation beats specification, provably (Ch. 4–5) — but loose isn’t vague (Ch. 6): be tight on intent, loose on method, and know when to prescribe (Ch. 7). It holds empirically, even on cheap models (Ch. 8). The same instinct fixes your context, not just your instructions (Ch. 9). And rigour doesn’t vanish — it moves to the audit (Ch. 10).
The message is the same whether you write prompts for a living or you’re a leader who thinks “AI governance” means a longer rulebook: the model got smart while your prompt stayed dumb. Catch up. Be tight about why, loose about how. And if the answer comes back as messy free text with a brilliant insight buried in it — that’s not a bug to constrain away, it’s the entire point. Have a cheap second pass clean it up.
Try it on your next prompt
Take one prompt you’re proud of — the long, careful, detailed one. Then:
- • Delete every “don’t.”
- • Collapse the procedure into a single north-star sentence of what great looks like.
- • Keep only the two-to-five constraints that would make the output wrong if broken.
- • Run the old version and the new version side by side on a frontier model — and read both.
The looser one is usually better. If it is, you’ve just found the ceiling you’d been writing into your own prompts. Tell me what you cut — leverageai.com.au.
It’s the difference between managing a clerk and leading an expert. We spent five years learning to manage clerks. The clerks turned into experts. Time to learn to lead.
The one-page version
Specification → orientation • Tight intent, loose method • Cut the don’ts • Keep only the load-bearing fence • Free text now, structure later • Audit the path, not the leash • The spec was the ceiling.
References & Sources
The evidence base behind every claim — primary research, industry analysis, and technical specifications
Research Methodology
This ebook draws on primary research from standards bodies, independent research firms, enterprise technology vendors, and consulting firms. Statistics cited throughout have been cross-referenced against primary sources.
Frameworks and interpretive analysis developed by Scott Farrell / LeverageAI are listed separately below — these represent the practitioner lens through which external research is interpreted, and are not cited inline to avoid self-promotional appearance.
Industry Analysis & Vendor Research
Andrew Ng / DeepLearning.AI, The Batch — How Agents Can Improve LLM Performance [1]
On HumanEval: GPT-3.5 zero-shot 48.1%, GPT-4 zero-shot 67.0%, GPT-3.5 in an agent loop up to 95.1%
https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/
Chroma Research, 2025 — Context Rot: How Increasing Input Tokens Impacts LLM Performance [4]
18 frontier models grow increasingly unreliable as input length grows, within their context windows
https://www.trychroma.com/research/context-rot
Andrej Karpathy — LLM Wiki [8]
Claims-and-edges knowledge base with a periodic lint pass for contradictions, stale claims, orphans
https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Transluce — Automatically Jailbreaking Frontier Language Models with Investigator Agents [10]
RL investigator agents reach 78-92% attack success against GPT-5, Claude Sonnet 4, Gemini 2.5 Pro
https://transluce.org/jailbreaking-frontier-models
Andrej Karpathy, YC AI Startup School 2025 — Software Is Changing (Again) [13]
Autonomy slider; keep AI on a leash via fast verification loops; Iron Man suits not robots
https://www.youtube.com/watch?v=LCEmiRjPEtQ
Primary Research & Standards Bodies
Anthropic — Effective Context Engineering for AI Agents [2]
LLMs have an attention budget; context is a finite resource with diminishing marginal returns; n-squared attention
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Liu et al., Transactions of the ACL 2024 — Lost in the Middle: How Language Models Use Long Contexts [3]
Performance highest at start/end of context, degrades significantly in the middle (the U-curve)
https://aclanthology.org/2024.tacl-1.9/
Anthropic — Code Execution with MCP [5]
With thousands of tools, agents process hundreds of thousands of tokens before reading a request
https://www.anthropic.com/engineering/code-execution-with-mcp
Yin et al., arXiv 2508.16260, 2025 — MCPVerse: A Real-World Benchmark for Agentic Tool Use [6]
Agent performance degrades substantially as the number of available tools increases
https://arxiv.org/abs/2508.16260
OpenAI — Best Practices for Prompt Engineering with the OpenAI API [7]
Guidance: say what to do instead of what not to do; reframe prohibitions as positive instructions
https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
OpenAI Cookbook — GPT-4.1 Prompting Guide [9]
Newer models follow instructions more closely and literally; one clear sentence is usually sufficient to steer behaviour
https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide
Cisco / HarmBench, arXiv 2504.11168 — Evaluating Safety of DeepSeek R1 with HarmBench [11]
50 HarmBench jailbreak prompts achieved 100% attack success against DeepSeek R1
https://arxiv.org/abs/2504.11168
OWASP GenAI Security Project — LLM01:2025 Prompt Injection [12]
Prompt injection ranked the #1 LLM application risk; prompt-level controls are bypassable
https://genai.owasp.org/llmrisk/llm01-prompt-injection/
LeverageAI / Scott Farrell — Practitioner Frameworks
The interpretive frameworks, architectural patterns, and practitioner analysis in this ebook were developed through enterprise AI transformation consulting. The articles below are the underlying thinking behind those frameworks. They are listed here for transparency and further exploration — not cited inline, as this is the author's own analytical voice.
Scott Farrell / LeverageAI — Context Engineering: Why Building AI Agents Feels Like Programming on a VIC-20 Again
Context is attention not just capacity; cluttered context degrades output
https://leverageai.com.au/context-engineering-why-building-ai-agents-feels-like-programming-on-a-vic-20-again/
Scott Farrell / LeverageAI — The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG
North Star Governance; Ingestion + Janitor under one directive; hallucinated consolidation
https://leverageai.com.au/the-index-is-the-data-how-a-self-cleaning-wiki-graph-out-thinks-rag/
Scott Farrell / LeverageAI — The Uncomfortable Truth About AI and Effort
AI as intention compiler; output quality is bounded by precision of intent
https://leverageai.com.au/the-uncomfortable-truth-about-ai-and-effort/
Scott Farrell / LeverageAI — Pre-Thinking Prompting: Why Your AI Outputs Fail and How to Fix Them
The Two-Job Trap; separate understanding from solving; free text now, structure later
https://leverageai.com.au/pre-thinking-prompting-why-your-ai-outputs-fail-and-how-to-fix-them/
Scott Farrell / LeverageAI — Stop Nursing Your AI Outputs: Nuke Them and Regenerate
AI commoditised thinking, not judgment; treat outputs as regenerable, invest in the kernel
https://leverageai.com.au/stop-nursing-your-ai-outputs-nuke-them-and-regenerate/
Scott Farrell / LeverageAI — AI Doesn't Fear Death: You Need Architecture, Not Vibes, for Trust
Prompts are manners, architecture is physics; can't beats shouldn't; scope permissions not behaviour
https://leverageai.com.au/ai-doesnt-fear-death-you-need-architecture-not-vibes-for-trust/
Scott Farrell / LeverageAI — Stop Asking AI Why It Decided: Build Decisions That Carry Their Own Proof
Model proposes via micro-judgements; deterministic code evaluates the graph; governance by structure
https://leverageai.com.au/stop-asking-ai-why-it-decided-build-decisions-that-carry-their-own-proof/
Scott Farrell / LeverageAI — Designing Loops, Not Prompts
The unit shifts from prompt to loop; durable state outside the agent; the path is inspectable
https://leverageai.com.au/designing-loops-not-prompts-a-field-guide-to-agentic-loops-and-who-holds-the-state-machine/
Scott Farrell / LeverageAI — Markdown as an Operating System
The north star as a plain-text file (NORTH_STAR.md / AGENTS.md) the agent reads and steers by
https://leverageai.com.au/markdown-as-an-operating-system/
About This Reference List
Compiled June 2026. All URLs verified at time of compilation. Regulatory documents and standards specifications are subject to revision — check primary sources for the most current versions.
Some links to academic papers and vendor research may require free registration. Government and standards body publications are freely accessible.