The North Star Prompt: Stop Writing Specs for Models That Can Think

SF Scott Farrell β€’ June 30, 2026 β€’ scott@leverageai.com.au β€’ LinkedIn

Prompting & AI Practice

πŸ“˜ Want the complete guide?

Go deeper: Read the full eBook β€” The North Star Prompt β†’

The North Star Prompt: Stop Writing Specs for Models That Can Think

Tight intent, loose method β€” why over-prompting a smart model is a denial-of-service attack on its intelligence, and why a north star still isn’t “no rules.”

By Scott Farrell, LeverageAI β€” leverageai.com.au

TL;DR

  • The world changed under the prompt. The prescriptive style β€” tasks, formats, a wall of “don’t”s β€” was right for GPT-3 and early GPT-4. On models that can actually think, it now caps the output instead of raising it.
  • A prompt used to be a specification. Now it’s an orientation. Give the model a clear north star, a short fence of genuinely load-bearing constraints, and the latitude to be good β€” then get out of the way. The most destructive ingredient to delete first is the don’t-list.
  • Loose is not vague, and rigour doesn’t vanish β€” it moves. A too-loose north star produces confident nonsense; the craft is tight intent, loose method. And governance shifts from the input leash (“did it obey my rules?”) to the audit (“did it reason well, and can I see the path?”).

Picture your brightest staff member. Genuinely sharp, creative, sees around corners. Now sit them down, hand them a form, and then hand them a ten-page document on how to fill out the form. Put the date here. Format is DD/MM/YYYY. Don’t put it in the wrong box. Here’s what each field means. Here are twenty things not to write.

And then you’re surprised when you get back a perfectly-filled-out form with no insight in it. No brilliance. No “hey, I noticed something weird in the Q3 numbers.” Of course there’s no insight β€” you didn’t ask for insight, you asked for a form. You took your smartest person, turned them into a data-entry clerk, and then wondered where the genius went.

The genius went into getting the date format right. Because that’s what you signalled mattered.

I’ve changed my mind about how you’re supposed to prompt these models, and this is the write-up. The short version: I used to write prompts like a list of tasks. Do this, then that, here are the parameters, here’s the format, here are all the things you must never do. And it worked β€” back on GPT-3 and early GPT-4, you kind of had to. The model wasn’t that bright, so you thought for it. You handed it a very narrow lane and basically drove the car yourself, with the model as the steering wheel.

But the models changed. GPT-5.5, Claude 4.7, 4.8 β€” these things are genuinely brilliant now. And my old prompt style is actively hurting the result. I’m hemming the model in. I’m taking something that could give me a surprising, insightful answer and forcing it to colour inside the lines of my crappy little form.


The world changed under the prompt

Here’s the shift, named plainly. The old model of prompting is specification: you, the human, hold the intelligence, and the prompt is the spec you hand to a fast-but-dim executor. Every rule, format and don’t is you doing the thinking up front because the model can’t.

The new model is orientation: the model holds the intelligence, and the prompt’s job is not to think for it but to aim it. You give it a north star β€” the goal, the why, who it’s for, what “good” and “wrong” mean β€” hand it the tools, and get out of the way.

A prompt used to be a specification. Now it’s an orientation. You used to write the instruction manual. Now you write the mission.

And there’s hard evidence that the posture around a model can matter more than the model itself. In Andrew Ng’s now-famous numbers on a coding benchmark, GPT-3.5 scored 48.1% zero-shot; GPT-4 zero-shot did better at 67.0%. But wrapped in an iterative agent workflow β€” given room to plan, draft, critique and revise toward a target β€” GPT-3.5 reached 95.1%, blowing past raw GPT-4.11 (It’s worth saying out loud, because it’s mis-quoted everywhere: that 48%β†’95% jump is GPT-3.5, not GPT-4.11) The lesson isn’t “agents are magic.” It’s that how you orient a capable model toward an outcome can outweigh a generation of raw capability. Aim it, don’t cage it.


Over-specification now backfires (the mechanism)

This isn’t aesthetics. There’s a mechanism, and it’s precise.

A modern model has a roughly fixed cognitive budget per response. Every token of constraint, definition and prohibition is budget spent on you β€” on parsing and complying β€” instead of on the problem. Over-specification doesn’t make a smart model safer; it makes it dumber, because you’ve reallocated its attention from the work to the rules.

You don’t have to take my word for the budget metaphor. Anthropic’s own engineering guidance says it almost verbatim: LLMs have an “attention budget” they draw on when parsing context, and that context “must be treated as a finite resource with diminishing marginal returns.”1 The reason is architectural β€” a transformer lets every token attend to every other token, which is nΒ² pairwise relationships for n tokens, so “as its context length increases, a model’s ability to capture these pairwise relationships gets stretched thin.”1 This is exactly what we mean in our own Context Engineering work when we say context is attention, not just capacity.16 Every “do not” and every redundant definition is drawn from the same finite pool the model needs for thinking.

And it’s measurable. The peer-reviewed “Lost in the Middle” study found models systematically fail to use information buried in the middle of a long context β€” performance is highest at the start and end and sags in between.2 Chroma’s 2025 “Context Rot” tests put eighteen frontier models β€” GPT-4.1, Claude 4, Gemini 2.5, Qwen3 β€” through the same wringer and found all of them grow “increasingly unreliable as input length grows,” even on simple tasks, well inside their advertised windows.3 A million-token window is not a licence to fill it. Pile in seventeen don’ts and the one rule that actually matters is now buried in the middle, competing for attention with sixteen rules that don’t.

The agent world shows the same shape. Anthropic notes that when agents are wired to thousands of tools, “they’ll need to process hundreds of thousands of tokens before reading a request.”4 Benchmarks like MCPVerse confirm the direction: performance degrades as the action space grows, because the model has to find the right tool in a bigger haystack.5 More descriptions of capability, less budget for the actual job. More rules, same story.

Over-prompting a smart model is a denial-of-service attack on its intelligence.

So when you tell a smart AI to stay in a narrow lane, it will stay in the lane β€” it’s obedient, it’s a good employee. But it turns off most of its brain to do it. Or worse: most of its thinking gets consumed by the minute details of your prompt. It’s spending its cognition parsing your ten rules, satisfying your format, remembering your seventeen don’ts β€” instead of actually thinking about the problem. You’ve spent its intelligence on compliance instead of on the work. You’re prompting GPT-4.8 like it’s GPT-3.5, and you get GPT-3.5 behaviour back, because you told it to think small.


The most destructive ingredient: the don’t-list

If I had to name the single worst thing in a modern prompt, it’s the wall of “stop / don’t / never” β€” the negative constraints. When you load a prompt with “do not do X, do not mention Y, never assume Z,” three bad things happen at once:

  • You spotlight the forbidden thing. You’ve just put X, Y and Z front and centre in the model’s attention β€” you’ve made the thing you don’t want the loudest thing in the room.
  • You make it police itself. The model now spends budget checking its own output against your list instead of doing good work.
  • You’re solving an old model’s problem. Most of those don’ts are things a smart model wouldn’t have done anyway. You’re prompting GPT-4.8 like it’s GPT-3.5 β€” and getting GPT-3.5 behaviour back.

This isn’t a fringe opinion. OpenAI’s own prompt-engineering guidance tells builders to do the opposite of a don’t-list: “say what to do instead” of what not to do β€” their worked example replaces a string of prohibitions with a single positive instruction.6 The fix for a prohibition is a purpose.

I caught myself doing this right the other day. I needed an agent to understand that a particular page’s job is not to answer the user’s question. The old instinct is to write “do NOT answer the question.” But I stopped: it’s definitely not answering the question, that’s the no-no β€” but I don’t like negative prompting, so keep it non-prescriptive, hence the north star. Instead of the prohibition I stated the positive purpose β€” “this page exists to make the source relevant to the query so the agent can gauge whether it’s appropriate” β€” and the don’t took care of itself. Reframe the prohibition as a purpose.

The don’ts are a fossil. They’re left over from when models were dumb and you had to fence them in. On a modern model, the fence is the problem.


The replacement: the north star

The clearest proof I have isn’t theory β€” it’s an image generator. We make quote-card images for social posts, and the hardest case is a structured “trace” diagram: a side-by-side flow of how plain RAG versus a wiki-graph handles the same support ticket. Arrows, indentation, links, code-style comments. Here’s the evolution, because the evolution is the argument.

  • Attempt A β€” prescriptive, flatten the text. Send the model the quote text, tell it to typeset the words. Result: accurate, and an unusable wall of text. It flattened the two-column flow into one linear paragraph. Technically faithful, visually dead.
  • Attempt B β€” prescriptive, reproduce the HTML. Send the source HTML: “render this faithfully, preserve the structure.” Better β€” it kept the diagram, the colours, the arrows. But it was still transcribing, reproducing the source’s look including its flaws (wide lines clipped off the edge).
  • Attempt C β€” the north star. Same HTML, reframed completely: “NORTH STAR: make it visually interesting β€” something that earns a scroll-stop. The HTML is provided only to show the original structure and meaning. Understand it, then design a new, more interesting card that honours it. Hard constraints: keep the words exact; everything must fit β€” wrap, never clip; house look. Beyond those, be creative.”

Attempt C came back a genuinely designed infographic: two columns with icons, a “VS.” badge, the RAG path dead-ending in a red βœ•, and the wiki path flowing along a winding road to a flag β€” it invented a visual metaphor for “this one leads somewhere better.” Words exact. Nothing we asked for line-by-line; everything we actually wanted.

The unlock wasn’t a better description of the card. It was telling the model what the card was for β€” and showing it the structure, not the styling.

A prescriptive prompt asks the model to be a good typist. A north-star prompt asks it to be a good designer. Same model β€” you choose which one shows up.

The same thing works on prose. Take the over-specified service report and replace it with: “Our company prides itself on customer satisfaction. The service department’s customer-sat is declining. Go look. Find the causes. I want insights and recommendations. What other data do you need? Here are the tools β€” ticketing, CRM, last year’s surveys. Write it up however makes the point best; free text is fine.” That’s the whole prompt. And you’ll get something a form never could have produced β€” maybe “the problem isn’t the service department, it’s that sales is over-promising delivery dates and service eats the complaints.” A form would never surface that. You have to leave room for the answer you didn’t know to ask for.

The reusable shape

Strip it down and the pattern is portable:

You are <role / goal in one line>.

NORTH STAR: <one vivid sentence of what GREAT looks like β€”
the outcome, not the method>.

<Context / source material β€” provided to inform, explicitly
"for understanding, not to copy literally">

Hard constraints (do not break):
- <the 2–5 things that would make the output WRONG if violated>

Beyond those constraints, be creative β€” <restate the latitude>.

Three clauses did most of the work for us:

  • “For structure reference only, not to be shown literally.” When you hand over source material, tell it whether to transcribe it or learn from it. That one clause turned “reproduce the HTML” into “understand the HTML.”
  • Name the goal’s purpose, not its properties. “Earns a scroll-stop” is optimisable; “colourful” is just a checklist item.
  • Put the freedom in writing. “Be creative β€” deliver something more interesting than the plain original” actually changes behaviour. Models are conservative by default and under-design unless you license them not to.

But a north star is not “no rules”

Here’s where the naΓ―ve version of this idea falls apart, so let me kill it now. “Stop giving instructions, let the AI cook” is not the lesson. A loose north star isn’t freedom β€” it’s negligence.

I learned this building a “Janitor” agent for a wiki-graph β€” the thing that compacts and merges claims over time so the knowledge base stays coherent. Give the Janitor too loose a directive (“tidy up the wiki”) and it hits a failure mode I call hallucinated consolidation: it merges two genuinely different ideas into one false claim, quietly erasing a real distinction. That’s the whole risk of vagueness in one word. (This is the spine of our North Star Governance work β€” a single directive that governs an Ingestion agent and a Janitor agent without a rulebook.14)

So the north star isn’t the absence of direction. It’s a different kind. It’s a constraint/latitude split:

Hard constraints Creative latitude
The few things that, if broken, make the output wrong β€” not just different. For the cards: the words must be exact; it must fit the frame; brand colours. Non-negotiable, stated flatly. Everything else. Layout, metaphor, hierarchy, emphasis. Explicitly handed back: “beyond those constraints, be creative.”

The art is keeping the constraint list short and load-bearing. Every line you add is latitude you take away. So ask of each rule:

“If the model ignored this, would the result be wrong, or just not how I pictured it?” If it’s the latter β€” cut it. That’s not your call to make; it’s the model’s room to be good.

You can watch me doing exactly this cutting, line by line, in real edits to my own prompts β€” “‘translate every proposed change; drop nothing, add nothing’ β†’ too prescriptive”; “don’t have rules like ‘chapters don’t support β†’ drop’ β€” review appropriateness against the north star instead”; and the funniest one, catching myself mid-prompt: “so much for non-prescriptive: ‘You may sharpen the wording of the headline. You may not soften the claim or swap the thesis.’ β†’ remove.” Every bullet is a small fence. The art now is taking fences away.

North star = one goal + a short fence + freedom inside the fence. Not a spec, not a void.

“But you told me to specify precisely”

If you’ve read my earlier work this looks like a contradiction. I argue elsewhere that AI is an intention compiler β€” it can only compile what you specify, so vague intent compiles to generic output.18 Now I’m telling you to stop specifying. Which is it?

Both, because they’re about different things. The thing to be precise about is intent; the thing to stop over-specifying is procedure. A north star is high-information about purpose and low-information about method. The old prompt is the reverse β€” drowning in procedure, silent on purpose β€” which is why it gets you procedure-following with no purpose-serving. Tight intent, loose method.

And don’t mistake “looser” for “fuzzier.” OpenAI’s own GPT-4.1 guidance notes that newer models “follow instructions more closely and more literally” than their predecessors, which “more liberally inferred intent.”7 The flip side, in their words: “a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model.”7 That is the north star, validated from the other direction β€” one sharp sentence of intent beats a paragraph of hedged procedure. Loose method, tighter intent.

Most bad prompts get this exactly backwards: vague on purpose (“write something about our product”), suffocating on procedure (“exactly three bullets, don’t use the word ‘leverage’, start with a question”). Want the opposite.

When to stay prescriptive

This isn’t universal. North-star prompting is for tasks with a wide solution space and a taste-based target β€” design, writing, synthesis, “make this good.” Stay prescriptive when:

  • the output feeds a machine and must match an exact schema or format,
  • there’s a single correct answer (a calculation, a lookup),
  • a constraint is genuinely safety- or correctness-critical β€” then it’s a hard constraint, stated flatly, not latitude.

The skill is sorting which parts of a task are “must be exactly X” (prescribe) from “must be good” (north star) β€” and not accidentally prescribing the second kind. Most over-prompting is precisely that accident: dictating the creative parts because dictating feels safer, when it’s what’s holding the quality down.

And the output doesn’t have to be tidy

People get stuck here: “But if I let it write free text, I can’t parse it; I can’t put it in my pipeline.” So what? If the free text needs structure, let a second model parse it β€” that’s cheap now. Don’t crush the thinking step into a rigid schema just because a downstream step wants structure. Separate them. Let the smart pass be smart and messy; let a dumb, cheap, deterministic pass tidy it up afterwards. Intelligence first as free text; structure second as a cheap transform. That’s the whole point of not making one prompt do two jobs at once17 β€” and it dissolves the “but I need valid JSON” objection. You do. In the second pass.


This is empirical, not a vibe

I don’t want this read as theory, so here are real moments from actual coding sessions.

The Haiku reversal β€” the deepest one. I was getting bad output from a cheap model on a dental-market report, and my instinct was to add more instructions. Wrong fix. What I actually wrote to course-correct: “I’d like the prompt to be less prescriptive, allow the model more freedom to use its own ideas. Present the data and the benchmark, and ask it to comment. I think the over-prescription was tripping up Haiku β€” make the data more clear, as opposed to telling it what to do. Give it all the data, and let it write about it.” The over-prescription was causing the small model to choke. The fix wasn’t more rules β€” it was clearer data and fewer instructions. That’s the counter-intuitive heart of it: when a weaker model struggles, the reflex is to constrain harder, and that’s often exactly backwards.

The result, next session, in one line: “the less prescriptive run, on medium, is pure gold compared to what we had before.” Less prompt, smaller model, better output.

Let it make its own criteria. Scoring some models blind, I told the agent to generate its own scoring rubric β€” only light prompting, let it do the work, not me, non-prescriptive prompt. Don’t hand it your rubric. A frontier model often builds a better one than you would.

Don’t pre-decide the findings. Drafting a research-report prompt, I cut my own conclusions: “I’d rather remove prescriptive findings β€” that’s the agent’s job once it considers the data. We can have loose suggestions, ideas to consider, but the conclusions are the work. Don’t steal them from the model.”

No fixed word counts. Generating one-line summaries for a knowledge graph: “give the model a north star, don’t ask for a specific number of words.” The model knows how long a good one-liner is. A word count just forces padding or truncation.

The through-line in all of these is the same as our Stop Nursing Your AI Outputs argument: we kept moving work off brittle deterministic heuristics and onto the model’s judgment β€” because “AI commoditised thinking; it didn’t commoditise judgment,”15 and “interesting” is judgment.


The same instinct, on the data side

There’s a sister-insight worth one section, because it’s the same move pointed at context instead of instructions.

Classic RAG spends all its cleverness before the model sees anything β€” chunking, embeddings, re-rankers, “did I fetch exactly the right eight paragraphs.” Same control-freak instinct as the over-specified prompt: don’t trust the model, pre-digest everything, hand it a tiny pre-approved window. The move that beat it, for relationship-heavy corpora, was almost embarrassingly simple: stop minimising what the model reads. Load the whole territory and let the model’s attention do the selecting.14

Classic retrieval asks “what’s the smallest relevant context I can fetch?” My reviewer asks “what’s the broadest useful context I can afford before making a judgment?” Different question. Reading widely, even back to the source, and then deciding isn’t laziness β€” it’s the correct method for a smart reader. (It’s a sign of where things are going that Andrej Karpathy independently sketched an “LLM Wiki” in April 2026 with a periodic “lint pass” to keep it coherent β€” near-identical in shape to our Janitor.13)

Move the intelligence out of your brittle pre-processing and into the model β€” because the model is now the smartest thing in the loop. It didn’t used to be. That’s what changed.

On the instruction side: stop pre-thinking the procedure; give intent and let it reason. On the context side: stop pre-digesting the data; give it the territory and let attention reason. Same lesson, twice.


Keep it honest: test the path, not just the answer

The obvious objection: if you let it roam, how do you know it did well? You can’t eyeball a rigid output checklist anymore.

Testing those wiki agents, I found you stop testing the answer and start testing the path. What did it read? Which sources did it go to? Did it hit the canonical page or wander off to a tangent? The trace of how it thought is more inspectable β€” and more stable β€” than the final text, because the text is partly a function of how sharp the model is on the day. The navigation is the reasoning made visible.

So governance doesn’t disappear when you stop micro-managing the prompt. It moves β€” from “did it obey my rules” (input-side control) to “did it reason well, and can I see the reasoning” (path-side verification). This is the same thing I argue in the proof-carrying decisions work: you don’t make a big AI decision governable by cramming rules into the prompt; you make it governable by having it show its working and checking the working with something deterministic.20 The model proposes, freely and smartly; a cheap, dumb, reliable layer checks.

And here’s why the input-rule version of governance was never going to hold anyway: prompt-level rules are bypassable by design. Reinforcement-learning “investigator” agents jailbreak GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro at success rates of 78–92%.8 Cisco ran fifty standard jailbreak prompts at DeepSeek R1 and got a 100% bypass rate β€” every safety rule in the prompt, gone.9 Prompt injection sits at the very top of the OWASP risk list for LLM apps precisely because instruction-level controls can be talked around.10 As we put it in our Architecture, Not Vibes work: prompts are manners; architecture is physics.19 A don’t-list is manners. Real control is the deterministic check around the model.

Don’t put the governance in the leash. Put it in the audit.

Even Karpathy’s “keep the AI on a leash” β€” from his 2025 “Software Is Changing (Again)” talk, with its autonomy slider and its “build Iron Man suits, not Iron Man robots” β€” is, read closely, a leash on the path, not the instructions: fast human-verification loops, not a longer rulebook.12 The north star can even become a literal file β€” on one project the directive lives in a NORTH_STAR.md22 that an hourly autonomous loop reads first, then does one north-star-aligned “bite” of work and stops.21 State the destination clearly enough and the agent can navigate without a turn-by-turn route.


Stop managing the clerk. Start leading the expert.

Step back and the whole thing is a mindset change. Your prompt is a confession. A wall of don’ts confesses that you don’t trust the model β€” or that you’re still prompting the model you used three years ago. If your prompt would insult your smartest employee, it’s insulting your model too, and you’ll get insulted-employee work back.

The old skill was writing a complete specification. The new skill is writing a clear north star and tolerating a surprising answer.

The spec was never the quality. It was the ceiling.

The message is the same whether you write prompts for a living or you’re a leader who thinks “AI governance” means a longer rulebook: the model got smart while your prompt stayed dumb. Catch up. Be tight about why, loose about how. And if the answer comes back as messy free text with a brilliant insight buried in it β€” that’s not a bug to constrain away, it’s the entire point. Have a cheap second pass clean it up.

It’s the difference between managing a clerk and leading an expert. We spent five years learning to manage clerks. The clerks turned into experts. Time to learn to lead.

Try it on your next prompt

Take one prompt you’re proud of β€” the long, careful, detailed one. Delete every “don’t.” Collapse the procedure into a single north-star sentence of what great looks like. Keep only the two-to-five constraints that would make the output wrong if broken. Then run the old version and the new version side by side on a frontier model, and read both.

If the looser one is better β€” and it usually is β€” you’ve just found the ceiling you’d been writing into your own prompts. Tell me what you cut: leverageai.com.au.

References

External evidence (statistics and quotes) is cited from primary or near-primary sources. LeverageAI references are cited for ideas and frameworks only, never for statistics.

  1. [1]Anthropic Engineering. “Effective Context Engineering for AI Agents.” β€” “LLMs have an ‘attention budget’… Context, therefore, must be treated as a finite resource with diminishing marginal returns… nΒ² pairwise relationships for n tokens.” anthropic.com/engineering/effective-context-engineering-for-ai-agents
  2. [2]Liu, N. F., et al. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the ACL, 2024. β€” Performance is highest when relevant information is at the start or end of the context and “significantly degrades when models must access relevant information in the middle.” aclanthology.org/2024.tacl-1.9/
  3. [3]Chroma Research. “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” 2025. β€” Across 18 models (GPT-4.1, Claude 4, Gemini 2.5, Qwen3), “performance grows increasingly unreliable as input length grows,” even on simple tasks. trychroma.com/research/context-rot
  4. [4]Anthropic Engineering. “Code Execution with MCP.” β€” With thousands of tools, agents “need to process hundreds of thousands of tokens before reading a request.” anthropic.com/engineering/code-execution-with-mcp
  5. [5]Yin, M., et al. “MCPVerse: A Real-World Benchmark for Agentic Tool Use.” arXiv:2508.16260, 2025. β€” Agent performance degrades substantially as the number of available tools increases. arxiv.org/abs/2508.16260
  6. [6]OpenAI. “Best Practices for Prompt Engineering with the OpenAI API.” β€” Core guidance: “Instead of just saying what not to do, say what to do instead” (prohibitions reframed as positive instructions). help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
  7. [7]OpenAI. “GPT-4.1 Prompting Guide” (OpenAI Cookbook). β€” “GPT-4.1 is trained to follow instructions more closely and more literally”; “a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model.” developers.openai.com/cookbook/examples/gpt4-1_prompting_guide
  8. [8]Transluce. “Automatically Jailbreaking Frontier Language Models with Investigator Agents.” 2025. β€” RL investigator agents reach 78–92% attack success against Claude Sonnet 4, GPT-5 and Gemini 2.5 Pro on high-risk behaviours. transluce.org/jailbreaking-frontier-models
  9. [9]Cisco / HarmBench evaluation of DeepSeek R1. arXiv:2504.11168, 2025. β€” 50 HarmBench jailbreak prompts produced a 100% attack success rate against DeepSeek R1. arxiv.org/abs/2504.11168
  10. [10]OWASP GenAI Security Project. “LLM01:2025 Prompt Injection.” β€” Prompt injection is the #1 LLM application risk; prompt-level controls are bypassable (including “ignore previous instructions” exploits). genai.owasp.org/llmrisk/llm01-prompt-injection/
  11. [11]Ng, A. “How Agents Can Improve LLM Performance.” DeepLearning.AI, The Batch. β€” On HumanEval: GPT-3.5 zero-shot 48.1%, GPT-4 zero-shot 67.0%; GPT-3.5 in an iterative agent workflow up to 95.1%. (The widely-quoted “48%β†’95%” lift is GPT-3.5, not GPT-4.) deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/
  12. [12]Karpathy, A. “Software Is Changing (Again).” Y Combinator AI Startup School, June 2025. β€” The “autonomy slider,” keeping AI “on a leash” via fast human-verification loops, and “Iron Man suits, not Iron Man robots.” youtube.com/watch?v=LCEmiRjPEtQ
  13. [13]Karpathy, A. “LLM Wiki.” GitHub Gist, April 2026. β€” A claims-and-edges knowledge base where the LLM does the bookkeeping, plus a periodic “lint pass” for contradictions, stale claims and orphan pages. gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
  14. [14]LeverageAI (Scott Farrell). “The Index Is the Data: How a Self-Cleaning Wiki-Graph Out-Thinks RAG” (North Star Governance; Ingestion + Janitor; hallucinated consolidation). leverageai.com.au/the-index-is-the-data-how-a-self-cleaning-wiki-graph-out-thinks-rag/
  15. [15]LeverageAI (Scott Farrell). “Stop Nursing Your AI Outputs β€” Nuke Them and Regenerate” (“AI commoditised thinking; it didn’t commoditise judgment”). leverageai.com.au/stop-nursing-your-ai-outputs-nuke-them-and-regenerate/
  16. [16]LeverageAI (Scott Farrell). “Context Engineering: Why Building AI Agents Feels Like Programming on a VIC-20 Again” (context is attention, not just capacity). leverageai.com.au/context-engineering-why-building-ai-agents-feels-like-programming-on-a-vic-20-again/
  17. [17]LeverageAI (Scott Farrell). “Pre-Thinking Prompting: Why Your AI Outputs Fail and How to Fix Them” (the Two-Job Trap; free text now, structure later). leverageai.com.au/pre-thinking-prompting-why-your-ai-outputs-fail-and-how-to-fix-them/
  18. [18]LeverageAI (Scott Farrell). “The Uncomfortable Truth About AI and Effort” (AI as intention compiler; precision of intent). leverageai.com.au/the-uncomfortable-truth-about-ai-and-effort/
  19. [19]LeverageAI (Scott Farrell). “AI Doesn’t Fear Death β€” You Need Architecture, Not Vibes, for Trust” (“prompts are manners; architecture is physics”). leverageai.com.au/ai-doesnt-fear-death-you-need-architecture-not-vibes-for-trust/
  20. [20]LeverageAI (Scott Farrell). “Stop Asking AI Why It Decided β€” Build Decisions That Carry Their Own Proof” (proof-carrying decisions; model proposes, deterministic layer checks). leverageai.com.au/stop-asking-ai-why-it-decided-build-decisions-that-carry-their-own-proof/
  21. [21]LeverageAI (Scott Farrell). “Designing Loops, Not Prompts” (the north star as a steering artifact; durable state outside the agent). leverageai.com.au/designing-loops-not-prompts-a-field-guide-to-agentic-loops-and-who-holds-the-state-machine/
  22. [22]LeverageAI (Scott Farrell). “Markdown as an Operating System” (the north star as a file, e.g. NORTH_STAR.md). leverageai.com.au/markdown-as-an-operating-system/

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Β© 2026 Leverage AI, Scott Farrell. All rights reserved. This content is made available on a limited, revocable, read-only basis only. No licence or right is granted to copy, reproduce, republish, scrape, store, adapt, summarise, index, embed, or use this content to create derivative works, work product, deliverables, methodologies, training materials, prompts, templates, software, services, research, or commercial outputs, whether by humans or machines, without prior written permission. This restriction includes internal business use, client work, consulting, advisory, implementation, and any use in or for artificial intelligence, machine learning, data extraction, retrieval, evaluation, fine-tuning, or knowledge-base construction.