The Three Ingredients Behind ‘Unreasonably Good’ AI Results

SF Scott Farrell • January 7, 2026 • scott@leverageai.com.au • LinkedIn

The Three Ingredients Behind ‘Unreasonably Good’ AI Results

Why some people get compound returns from AI while most plateau at linear improvements

📘 Want the complete guide?

Learn more: Read the full eBook here →

I’ve been using AI daily for two years. Recently, several things came together that made me realise something had shifted. The only term I can find for it is “unreasonably good.”

Not just “helpful.” Not just “faster.” But results that felt disproportionate to the effort I put in. Compound returns where small additions to my workflow produced outsized improvements.

A web search combined with RAG and follow-up queries that found connections I wouldn’t have made myself. A proposal system that noticed when my frameworks applied and restructured its output accordingly. A Linux router running Claude Code on a cron job that found logic bugs and suggested improvements in a recommendations.md file every hour.

These aren’t magic. They share a pattern. And once I understood the pattern, I understood why most people plateau with AI while some get results that seem “unreasonably good.”

The Compound Interest Analogy

Think about compound interest. In year one, the returns look modest. By year ten, the gap between compound and linear growth is enormous. The same principle applies to AI workflows.

Most people use AI in a linear way: input goes in, output comes out. They might improve their prompts, upgrade to a bigger model, or add more tools. Each improvement adds incrementally. This is the AI equivalent of simple interest.

But some configurations create compound returns. The output of one loop becomes the input for the next. Each cycle doesn’t just produce results—it improves the system’s ability to produce better results. Small additions don’t add; they multiply.

48% → 95%
Performance jump when adding agentic workflows to the same model1

Andrew Ng’s research demonstrates this dramatically. GPT-4 alone scores around 48% on the HumanEval coding benchmark. But GPT-3.5—a weaker model—with agentic workflows achieves 95%.1 The smaller model with better architecture outperforms the larger model without it.

This isn’t about prompts. It’s about architecture.

The Three Ingredients

After examining what made my workflows “unreasonably good,” I found three ingredients that appeared consistently. Remove any one, and you’re back to linear returns. Combine all three, and results compound.

The Unreasonably Good Triangle

1
Agency (Self-Reflection)

AI examines its own work, notices gaps, and decides what to do next. It doesn’t just generate—it evaluates, critiques, and iterates.

2
Tools (Reality Grounding)

AI touches the real world through web search, code execution, RAG retrieval, and API calls. It gathers information, tests hypotheses, and validates assumptions.

3
Orchestration (Persistence Loops)

Infrastructure that keeps AI running across sessions—cron jobs, hooks, persistent memory, and coordination between agents.

Andrew Ng’s four agentic design patterns—Reflection, Tool Use, Planning, and Multi-Agent Collaboration—map directly onto this triangle.2 So do the multi-agent orchestration frameworks showing 90.2% performance improvements over single-agent approaches.3

The pattern is consistent. The three ingredients aren’t optional extras—they’re the foundation.

Ingredient 1: Agency

A normal AI workflow does: retrieve → paste → answer.

An agentic workflow does: retrieve → read → notice → form hypotheses → retrieve again → cross-check → revise → restructure output.

That “energetic” behaviour is the key. The value isn’t the first search; it’s the follow-up searches triggered by interpretation. The AI reads its own results, notices something interesting, and decides to dig deeper. That turns retrieval into exploration.

“The Reflection Pattern involves repeating cycles of output generation, self-reflection, critique, and refinement, ultimately leading to more accurate and polished results.”
— Analytics Vidhya4

Self-reflection can improve problem-solving performance by up to 18.5 percentage points.5 But the bigger insight is that reflection transforms the nature of the work. Instead of single-pass generation, you get iterative exploration. Each pass builds on the last.

Ingredient 2: Tools

Without tools, AI generates plausible-sounding outputs disconnected from reality. With tools, it can test its ideas against the real world.

The ReAct framework captures this: Think → Act → Observe. The agent generates reasoning (“I should check the current API status”), takes action (calls the API), observes the result, and incorporates that observation into its next thought.5 This grounds the agent in real-world data, significantly reducing hallucination.

The tools don’t have to be complex. Web search, code execution, database queries, file reading—basic capabilities that let AI touch reality instead of just imagining it.

Key insight: Agents can build their own tools. When an agent repeatedly encounters a task its toolset can’t handle efficiently, it can construct a custom capability and add it to its toolkit. The next time it encounters a similar task, it uses the custom tool automatically. Tools compound.

Ingredient 3: Orchestration

Agency and tools create powerful individual sessions. Orchestration extends that power across time.

Consider the Ralph Wiggum plugin for Claude Code: it turns a single session into a persistent loop. When Claude thinks it’s done, the plugin intercepts the exit, re-feeds the original prompt, and Claude continues. Each iteration sees the modified files and git history from previous runs.6

Or consider hooks that automatically trigger actions at specific points—running your test suite after code changes, linting before commits, extracting lessons into a learned.md file.7

Orchestration creates the infrastructure for compound returns:

  • Persistence: What was learned in session N is available in session N+1
  • Coordination: Multiple agents work together, each specialising in what it does best
  • Automation: The system improves even when you’re not actively using it

Multi-agent orchestration—where a lead agent coordinates specialised sub-agents—shows 90.2% improvement over single-agent approaches.3 But you don’t need multiple agents to benefit from orchestration. Even simple persistence mechanisms create compound returns.

The Tweet Paradox

Here’s something counterintuitive I discovered: small outputs require more context engineering than large outputs.

A 200-character post that’s actually sharp requires the right frame, the right tension, the right novelty, the right voice, and the ruthless exclusion of everything else. When I asked AI to write short posts with minimal context, they came out generic. When I added the full context—research, frameworks, voice guidelines—quality jumped dramatically.

The naive assumption is that small = easy and big = hard. The reality is the opposite. Distillation is expensive. The smaller the output, the more cognition per character.

This explains why people plateau with AI on “simple” tasks. They assume short tasks need short prompts. But short, high-quality outputs need the deepest context.

What Elite Developers Are Doing

78% of developers now use AI tools, with 23% employing AI agents at least weekly.8 But more interesting than the adoption numbers is how they’re using them.

“The human developer’s role shifts from hands-on coder to high-level supervisor, defining goals and guardrails while the agent carries out the work.”
— Qodo9

Senior developers are moving from hand-coding to orchestration. They’re not writing more code with AI assistance—they’re writing less code while achieving more.

The shift isn’t from “human writes code” to “AI writes code.” It’s from linear workflows to compound loops. Define goals. Set constraints. Let agentic systems iterate. Refine the architecture based on what you observe.

The Compound Gap

Here’s the uncomfortable truth: the gap between compound and linear users is widening.

Organisations that established compound AI workflows six months ago now have systems that are 50%+ more cost-efficient and significantly more capable than when they started—without changing a single line of code.10 The improvement happened through accumulated learning, refined frameworks, and self-improving loops.

Meanwhile, linear users are still doing what they did six months ago, just slightly faster.

2x / 7 months
The length of tasks AI can reliably complete has doubled approximately every seven months11

McKinsey reports that AI can now reliably complete tasks lasting roughly two hours—up from minutes just two years ago.11 Gartner predicts 15% of day-to-day work decisions will be made autonomously by 2028, up from 0% in 2024.12

The compound users aren’t just ahead—they’re pulling away.

What This Means For You

If your AI results feel linear—helpful but not transformative—look at which ingredients you’re missing.

Missing Agency? Your AI generates but doesn’t evaluate. Add reflection loops. Let AI critique its own work before presenting it. Ask it to identify gaps and fill them.

Missing Tools? Your AI operates in a vacuum. Add reality grounding. Web search, code execution, RAG over your documents, API calls to validate assumptions. Ground generation in facts.

Missing Orchestration? Your sessions are isolated. Add persistence. Keep a learned.md file that accumulates insights. Run background processes that improve the system while you sleep. Connect sessions so each one builds on the last.

The “unreasonably good” feeling isn’t random. It’s the signal that compound loops have engaged. The three ingredients aren’t a guarantee—but without them, you’re stuck in linear land.


The bottom line: Most people who plateau with AI aren’t using the wrong model or writing bad prompts. They’re using linear architecture when compound architecture is available. The fix isn’t better prompts—it’s better systems.

The three ingredients are Agency, Tools, and Orchestration. Remove any one and returns stay linear. Combine all three and results compound. The gap between those who get this and those who don’t is widening every day.

References

  1. Octet Consulting. “Notes on Andrew Ng Agentic Reasoning 2024.” octetdata.com/blog/notes-andrew-ng-agentic-reasoning-2024 — “If you take an agentic workflow and wrap it around GPT 3.5, it actually does better than even GPT-4.”
  2. Insight Partners. “Andrew Ng: Why Agentic AI Is the Smart Bet for Most Enterprises.” insightpartners.com/ideas/andrew-ng-why-agentic-ai-is-the-smart-bet-for-most-enterprises — “Andrew Ng’s framework: Reflection, Tool Use, Planning, Multi-Agent Collaboration”
  3. LeverageAI. “The Team of One.” leverageai.com.au — “We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4.5 subagents outperformed single-agent Claude Opus 4 by 90.2%”
  4. Analytics Vidhya. “Agentic AI Reflection Pattern.” analyticsvidhya.com/blog/2024/10/agentic-ai-reflection-pattern — “The Reflection Pattern involves repeating cycles of output generation, self-reflection, critique, and refinement.”
  5. Galileo. “Self-Evaluation in AI Agents.” galileo.ai/blog/self-evaluation-ai-agents-performance-reasoning-reflection — “Self-reflection can improve problem-solving performance by up to 18.5 percentage points.”
  6. Paddo.dev. “Ralph Wiggum Autonomous Loops.” paddo.dev/blog/ralph-wiggum-autonomous-loops — “When it thinks it’s done, the plugin’s Stop hook intercepts the exit, re-feeds the original prompt, and Claude continues.”
  7. Anthropic. “Claude Code Best Practices.” anthropic.com/engineering/claude-code-best-practices — “Hooks automatically trigger actions at specific points, such as running your test suite after code changes.”
  8. RedMonk. “10 Things Developers Want from Agentic IDEs.” redmonk.com/kholterhoff/2025/12/22/10-things-developers-want-from-their-agentic-ides-in-2025 — “78% of developers now use or plan to use AI tools, and 23% employ AI agents at least weekly.”
  9. Qodo. “Top 5 Agentic AI Tools.” qodo.ai/blog/agentic-ai-tools — “The human developer’s role shifts from hands-on coder to high-level supervisor.”
  10. LeverageAI. “Software 3.0.” leverageai.com.au — “Organizations that established token-burning operations six months ago now have systems that are 50%+ more cost-efficient.”
  11. McKinsey. “The Agentic Organization.” mckinsey.com/capabilities/people-and-organizational-performance/our-insights/the-agentic-organization — “The length of tasks that AI can reliably complete doubled approximately every seven months.”
  12. Sprinklr. “Agentic AI Workflow.” sprinklr.com/blog/agentic-ai-workflow — “Gartner predicts 15% of day-to-day work decisions will be made autonomously by 2028.”

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *