The Death of Shelf Software
and the Rise of Composable AI

Why AI-generated bespoke is replacing vendor procurement—and what enterprise leaders must do now

A comprehensive guide to navigating the AI procurement transformation

TL;DR

Part I: The Crisis

Chapter 1: The AI Value Gap

Why 95% of enterprises are failing to extract meaningful value from their AI investments

If your organisation has bought AI software only to watch staff spend more time correcting it than using it, you're in miserable company—but you're not alone. Welcome to the AI value gap, where the chasm between promise and performance has become an industry-wide crisis.

The Widening Chasm

BCG's September 2025 study of more than 1,250 firms worldwide delivers a sobering verdict. They categorise organisations into three tiers: the "future-built" (5%), the "scalers" (35%), and the "laggards" (60%). The gap between these groups isn't just wide—it's accelerating.

The Performance Divide

1.7×
Revenue Growth

Future-built vs laggards

3.6×
Total Shareholder Return

Three-year TSR advantage

1.6×
EBIT Margin

Operating profit improvement

Source: BCG, "The Widening AI Value Gap: Build for the Future 2025"

"The gap between AI hype and AI reality has become a chasm. Most enterprises are dabbling, stalling, or scaling noise."
— BCG Researchers, September 2025

The Abandonment Crisis

S&P Global's research paints an even grimmer picture. The share of companies abandoning most of their AI initiatives jumped to 42% in 2025—up from just 17% the previous year. This isn't a marginal miss. This is a 25-percentage-point surge in failure rates year-over-year.

These aren't pilots that failed in the lab. These are initiatives that consumed budget, management attention, and political capital, only to be quietly written off. The average organisation scrapped 46% of AI proof-of-concepts before they reached production.

2024: AI Project Abandonment 17%
2025: AI Project Abandonment 42%
Figure 1.1: The dramatic year-over-year surge in AI project abandonment rates (S&P Global Market Intelligence, 2025)

The Regulatory Alarm Bell

When Gartner warned over a year ago that the majority of generative AI pilots would die after proof-of-concept, many executives dismissed it as typical analyst pessimism. Their prediction has aged like fine wine.

But the most telling signal isn't from consultancies—it's from regulators. The FTC and SEC have moved from investigation to prosecution, filing enforcement actions against companies engaging in "AI-washing"—making deceptive claims about AI capabilities that don't exist.

FTC Operation AI Comply: By the Numbers

September 2024
5 initial enforcement actions filed
Through August 2025
12 total AI-washing cases brought
DoNotPay Settlement
$193,000 for "AI lawyer" false claims
FBA Machine Fraud
$15M defrauded from consumers

When regulators shift from warnings to prosecutions, the market incentives have become deeply misaligned.

The Human Cost

The productivity paradox is the cruelest twist. The promise was automation and efficiency. The reality in many organisations is staff spending more time checking AI-generated work than it would have taken to do the work themselves.

Every output needs review. Every recommendation needs verification. Every automation needs a human safety net. The AI was supposed to multiply productivity. Instead it's become a productivity tax.

"We hired the AI to save time. Now we hire people to babysit the AI."
— CTO, Fortune 500 Financial Services Firm (anonymous interview, 2025)

EY's 2025 survey found that most companies have suffered some form of risk-related financial loss from deploying AI. Not "might suffer" or "could potentially face." Have suffered. Past tense. Real money. Real compliance violations. Real customer-facing errors that required expensive remediation.

The Root Cause Question

So what's driving this spectacular failure rate? Is AI fundamentally oversold? Are the models not ready? Are enterprises just "doing it wrong"?

The answer is more structural—and more fixable—than either AI evangelists or sceptics want to admit. The crisis isn't technical. It's procurement, architecture, and organisational design colliding with a technology that doesn't behave like the enterprise software that came before it.

Two forces are at work simultaneously:

  1. Enterprises are running a 2005 procurement playbook against 2025 technology. Requirements-driven RFPs, feature checklists, multi-year contracts with vendor lock-in—all optimised for stable, deterministic software. AI is neither stable nor deterministic.
  2. Many vendors are selling yesterday's software with today's buzzwords. Legacy products with "AI" badges slapped on. Models baked in 18 months ago, now two generations behind. Proprietary wrappers around commodity APIs. The FTC is prosecuting this behaviour, but the market incentives remain misaligned.

The organisations in the 5% aren't smarter. They're not luckier. They didn't buy better AI. They bought differently, and they built differently. That difference—between 95% failure and 5% success—is what the next fifteen chapters unpack.

Chapter Takeaways

  • Only 5% of companies are extracting consistent AI value; the gap between leaders and laggards is widening rapidly
  • 42% of enterprises abandoned most AI initiatives in 2025, up from 17% the previous year—a crisis-level surge in failure rates
  • Regulators have moved from warnings to prosecutions, with 12 FTC AI-washing enforcement actions filed since 2024
  • The root cause isn't technical—it's structural: old procurement models colliding with perishable, probabilistic technology
  • The 5% who win didn't buy better AI—they bought differently and built differently

References & Further Reading

  • • BCG (2025). The Widening AI Value Gap: Build for the Future 2025. Boston Consulting Group, September 2025.
  • • S&P Global Market Intelligence (2025). AI Experiences Rapid Adoption, but with Mixed Outcomes.
  • • FTC (2024). FTC Announces Crackdown on Deceptive AI Claims and Schemes, September 2024.
  • • EY (2025). Most Companies Suffer Some Risk-Related Financial Loss Deploying AI, Survey Report.
Part I: The Crisis

Chapter 2: The Procurement Mismatch

Why requirements-driven RFPs fail catastrophically for AI technology

Traditional software procurement worked because the underlying assumptions held true: requirements were knowable upfront, vendor products were mature and stable, and implementation timelines measured in quarters were acceptable. The playbook was simple, effective, and catastrophically mismatched to AI.

The Old Playbook That Still Dominates

Walk into any enterprise procurement department and you'll see the same ritual playing out:

  1. Gather requirements from stakeholders across the organisation
  2. Write a detailed specification document (50-200 pages)
  3. Issue an RFP to three shortlisted vendors
  4. Evaluate demos against your checklist over 3-6 months
  5. Negotiate a multi-year contract with service-level agreements
  6. Deploy over 6–18 months with a systems integrator

The software would do roughly what the specification promised, or at least close enough that the delta could be managed with training and process adjustments. This worked for decades.

This model made perfect sense for ERP systems, CRM platforms, and business intelligence tools. Requirements for payroll processing or inventory management don't change fundamentally year-to-year. Oracle or SAP could freeze a scope, build to spec, and the product would remain relevant for a 3-5 year contract lifecycle.

"We spent nine months on the RFP. By the time we selected a vendor, the AI landscape had shifted twice. The capabilities we specified no longer existed."
— VP of Enterprise Architecture, Global Logistics Company

Why AI Shatters Every Assumption

AI doesn't just bend the old procurement model—it breaks it completely. Here's why:

Three Broken Assumptions

❌ Assumption 1: Requirements Can Be Frozen

Traditional: "The system shall process 10,000 transactions per hour with 99.9% uptime."

AI Reality: Use cases emerge through experimentation, not requirements gathering. You don't know what an LLM can do for your customer service workflow until you try it with real tickets. You can't spec "AI-powered contract review" the way you'd spec a database field. The value emerges from iterative testing, failure analysis, and tuning—not from a requirements document written six months before deployment.

❌ Assumption 2: Technology Is Stable

Traditional: "Version 8.2 will be supported through 2029."

AI Reality: GPT-5 launches in March 2023. By November 2025, it's replaced by GPT-5 Turbo with better performance, lower cost, and new capabilities. By late 2025, GPT-5, Claude Sonnet 4.5, and Gemini Pro 3 are all shipping with dramatically improved capabilities. If you signed a three-year contract in 2023 that locked you to a specific model version, you've ossified your capability the moment you bought it.

❌ Assumption 3: Software Is Deterministic

Traditional: "Given input X, the system returns output Y every time."

AI Reality: "The AI" isn't a thing you install. It's a capability you compose, instrument, and retrain continuously. AI systems are probabilistic: given input X, you get output Y with confidence Z, and Z varies based on training data, prompt design, model version, temperature settings, and context windows. Managing this isn't an installation project—it's an operating model.

The Time Paradox

Traditional RFP timelines are measured in quarters. AI evolution is measured in weeks.

The Procurement Timeline vs AI Evolution
Traditional RFP Timeline
Requirements (3mo) Vendor Selection (4mo) Negotiation (2mo) Deploy (6mo)
3mo
4mo
2mo
6mo
Total: 15 months
AI Model Evolution During Same Period
• Month 0: GPT-5 is state-of-the-art
• Month 3: GPT-5 Turbo ships (faster, cheaper)
• Month 6: Claude Sonnet 4.5 launches
• Month 9: GPT-5.5 rumors, pricing drops 40%
• Month 12: Gemini Pro 3 ships
• Month 15: GPT-5 released, 3x performance improvement

Result: Your "cutting-edge" AI is 3 generations behind at go-live

Figure 2.1: By the time traditional procurement completes, the AI landscape has evolved beyond recognition

The Outcome: Predictable Failure

When you run the old RFP playbook against AI, you get exactly the outcomes we're seeing:

  • 42% abandonment rates (S&P Global, 2025) — Projects that consumed months of requirements gathering, vendor evaluation, and contract negotiation, only to be quietly shelved because the deployed system doesn't match the evolved needs
  • Productivity regressions — Staff spending more time checking AI outputs than doing the work themselves, because the RFP specified "AI-powered" but didn't specify "actually useful"
  • Expensive write-offs — Three-year contracts signed for capabilities that became obsolete in six months, with no upgrade path that doesn't require renegotiating the entire deal

Not because AI doesn't work. Not because the vendors are incompetent. But because procurement assumed the technology would behave like the ERP systems it replaced.

It doesn't.

"We ran a textbook procurement. By-the-book RFP, thorough evaluation, robust contract. And it delivered exactly what we asked for—which turned out to be completely wrong."
— Chief Procurement Officer, Fortune 100 Retailer

The Fundamental Mismatch

The core problem is categorical. Traditional procurement treats software as a product to install. AI demands you treat it as a capability to cultivate.

Products Have:
  • Spec sheets
  • Version numbers
  • Deterministic behavior
  • Stable lifecycles
  • Install-and-forget deployment
Capabilities Have:
  • Feedback loops
  • Learning curves
  • Probabilistic outputs
  • Continuous evolution
  • Ongoing tuning and measurement

The organisation that tries to RFP its way to AI maturity is bringing a requirements document to an experimentation contest. It's bringing a five-year roadmap to a quarterly refresh cycle. It's bringing install-and-forget expectations to a tune-and-measure reality.

And then wondering why 42% of projects get abandoned.

Chapter Takeaways

  • Traditional RFP procurement already has a 55% failure rate—adding AI complexity makes it catastrophic
  • Requirements can't be frozen when use cases emerge through experimentation, not requirements gathering
  • AI models evolve quarterly while procurement timelines span 15 months—your solution is obsolete before deployment
  • AI is probabilistic, not deterministic—feature checklists tell you nothing about production performance
  • Treat AI as a capability to cultivate, not a product to install—the procurement model must shift from install-and-forget to tune-and-measure

References & Further Reading

  • • Loopio (2025). RFP Report: 2025 Trends & Benchmarks. Enterprise RFP win rates and performance insights.
  • • S&P Global Market Intelligence (2025). AI Experiences Rapid Adoption, but with Mixed Outcomes.
  • • Responsive (2024). RFP Research: Top Procurement Statistics to Know.
Part I: The Crisis

Chapter 3: The Vendor Problem

When "AI-powered" becomes a sales badge on yesterday's software

The procurement mismatch is only half the equation. The other half is what many vendors are actually selling when they slap "AI-powered" on the product brochure. When regulators start filing enforcement actions, you know the problem has become systematic.

FTC Operation AI Comply: The Crackdown

In September 2024, the Federal Trade Commission launched Operation AI Comply, targeting companies that relied on artificial intelligence claims to supercharge deceptive or unfair conduct.

12
Total AI-washing cases filed since 2024
$15M
Defrauded from consumers in one scheme alone

The Existential Vendor Dilemma

Legacy enterprise software vendors faced an existential threat in 2023: get AI into the product or watch customers churn to startups promising 10× productivity. So they did what large organisations do under pressure—they rushed.

Most started integrating AI 12–18 months ago, when GPT-3.5 was state-of-the-art and patterns for production deployment were immature. By the time their products ship today through the typical 9–12 month enterprise software development cycle, the "AI" baked into the product is already two generations behind what dominates the press and AI-native startup demos.

The Rebranding Strategy

Worse, some vendors didn't even bother integrating new models—they just rebranded existing features:

2015 Feature
Rules Engine
2024 "AI-Powered"
Decision Intelligence ✨
2016 Feature
Basic Search
2024 "AI-Enhanced"
Semantic Discovery ✨
"The autocomplete we've had for five years? Now it's a 'Generative AI Writing Assistant.' Same code. New badge."
— Senior Engineer, Enterprise SaaS Vendor (anonymous, 2025)

FTC Enforcement: Case Studies

DoNotPay
$193K Settlement

Claim: "AI lawyer" subscription service that could handle legal tasks

Reality: Made misleading statements about the capabilities of the technology; could not actually perform many advertised legal functions

FBA Machine
$15M Fraud

Claim: Consumers could make guaranteed money operating online storefronts using AI-powered software

Reality: Failed to deliver on promised earnings claims; defrauded consumers out of over $15 million in a business opportunity scheme

Air AI
Aug 2025 Action

Claim: AI tool could operate autonomously and enable buyers to replace or avoid hiring employees

Reality: Deceptive marketing claims about autonomous operation and employee replacement capabilities

The Incentive Structure Problem

When regulators move from investigation to prosecution, it signals the market incentives have become deeply misaligned. Vendors get rewarded for claiming AI capabilities, not for delivering AI value.

  • Sales cycles close faster when the deck has "AI" on every slide
  • Renewals are easier when you can point to an "AI roadmap"
  • Stock prices respond to AI announcements, not AI performance
  • Competitive pressure forces "me too" announcements

The market is buying the promise, and many vendors are optimising for the promise instead of the performance.

The Black Box Business Model

Even vendors trying to do it right are constrained by their own business model. They can't give customers what would actually create value because it would undermine their competitive position:

Why Vendors Can't Be Transparent

Model Swap-Out Flexibility

Admitting customers can swap models means admitting the product isn't the differentiator—the underlying API is

Detailed Evaluation Results

Exposing where the AI performs poorly on edge cases would reveal the system isn't ready for production

Granular Observability

Letting customers see prompts, costs, and failure patterns makes it easy to build around the vendor or replace them

The product has to be a black box, sold on trust and brand, because the underlying AI capability isn't proprietary.

The Commodity Trap

Everyone has access to the same OpenAI or Anthropic APIs. The only thing separating one vendor's "AI solution" from another's is:

  1. The wrapper — UI and integration glue
  2. The integration tax — How much work to connect to your systems
  3. The sales story — Marketing narrative and brand trust

None of these create durable competitive advantage when models are perishable and customers are learning that "AI-powered" is often just brand paint.

"We pay $200k/year for 'enterprise AI.' Last week I built the same thing in an afternoon using the OpenAI API and $50 of credits."
— Lead Developer, Financial Services Firm

The Predictable Outcome

When you buy "AI software" from an established vendor, you're often getting:

  • ✗ Yesterday's model (2-3 generations behind)
  • ✗ In a proprietary wrapper (no model swapping)
  • ✗ With no upgrade path (locked into obsolescence)
  • ✗ Evaluated against cherry-picked demos (not your data)
  • ✗ Sold by a team optimised for deal closure (not deployment success)

And then everyone acts surprised when 42% of projects get abandoned.

Chapter Takeaways

  • The FTC has filed 12 AI-washing enforcement actions since 2024—when regulators prosecute, market incentives are broken
  • Legacy vendors are shipping AI integrated 18 months ago—already 2-3 generations behind current capabilities
  • Some vendors simply rebranded existing features as "AI-powered" without upgrading the underlying technology
  • Vendors can't offer transparency (model swap-out, eval results, observability) without admitting the product isn't the differentiator
  • The underlying AI isn't proprietary—everyone uses the same OpenAI/Anthropic APIs wrapped in different integration layers

References & Further Reading

  • • FTC (2024). FTC Announces Crackdown on Deceptive AI Claims and Schemes, Operation AI Comply, September 2024.
  • • Gartner (2025). Hype Cycle for Generative AI. Agent-washing as emerging risk category.
  • • DLA Piper (2025). FTC's Latest AI-Washing Case: A Focus on Agentic AI and Productivity Claims, August 2025.
Part II: Why Shelf Software Is Obsolete

Chapter 4: The Software Evolution

From bespoke to shelf to obsolete—how AI invalidated 50 years of software economics

To understand why shelf software is dying, you need to understand why it existed in the first place. The entire history of commercial software is a history of cost amortisation—and AI just made amortisation irrelevant.

The Four Eras of Software Economics

1970s–1990s
Bespoke Era
$500k–$5M per system • 100% fit • Only for large enterprises
1990s–2010s
Shelf Revolution
$5k–$100k/year • 70% fit • Amortised across thousands of customers
2010s–2020s
Cloud/SaaS Evolution
Multi-tenancy • Subscription pricing • Further amortisation
2025+
AI Inflection Point
$50k bespoke • 100% fit • Marginal cost → zero • Amortisation obsolete

The Bespoke Era (1970s–1990s)

In the beginning, all business software was bespoke. If you wanted a system to manage inventory or process payroll, you hired a team of developers to build it from scratch.

The Economics Were Brutal
  • Development: $500k–$5M per system
  • Timeline: 12–36 months to build
  • Affordability: Only large enterprises
  • Cost model: You paid 100% of development
But The Fit Was Perfect
  • 100% workflow fit (built for your exact processes)
  • No compromise on features or design
  • Complete control over roadmap and updates
  • Competitive advantage from custom capabilities

The price tag put bespoke software out of reach for most organisations, but for those who could afford it, the perfect fit was worth the premium.

The Shelf Revolution (1990s–2010s)

Then someone had a clever insight: What if we built the software once and sold it to 1,000 customers?

This was the birth of the software industry as we know it. Oracle, SAP, Microsoft, Salesforce—all built on this model. Develop once, sell repeatedly, amortise the fixed costs across a growing customer base.

The compromise was explicit and acceptable:

Customers Accepted
Imperfect Fit for Affordability
Vendors Accepted
Generic Features for Scalability

The Cloud Evolution (2010s–2020s)

Cloud and SaaS took amortisation even further. Multi-tenancy meant one instance of the software served thousands of customers simultaneously, reducing infrastructure costs. Subscription pricing smoothed revenue and lowered the upfront barrier. Implementation times dropped from quarters to weeks.

But the fundamental model remained: one product, many customers, amortised development costs.

"The economics still made sense because software development was expensive. Even with agile and modern tooling, building production-grade enterprise apps cost millions."

The AI Inflection Point (2025+)

AI hasn't just improved software development—it's collapsed the cost structure entirely.

What used to take a team of six developers eight months can now be generated by AI in days or weeks. Marginal cost of software creation is approaching zero.

The Economic Inversion

BEFORE AI (2020)
$2M
Bespoke Cost
$100k/yr
Shelf Software
Shelf Wins
20× cheaper
↓ AI REVOLUTION ↓
AFTER AI (2025+)
$50k
AI-Generated Bespoke
$100k/yr
Shelf Software
Bespoke Wins
2× cheaper + 100% fit

When bespoke software cost $2M, you chose shelf. When AI can generate bespoke for $50k with 100% workflow fit, the shelf option becomes economically irrational.

The Go-to-Market Tax

Here's what customers actually pay for when they buy shelf software:

Where Your $100k/Year Actually Goes
Actual Software Delivery & Maintenance $10k–$20k (10–20%)
15%
Sales Cost (teams, demos, RFPs) $20k–$30k (20–30%)
25%
Marketing (brand, events, content) $15k–$25k (15–25%)
20%
Channel Partners (resellers, integrators) $20k+ (20%+)
20%
Gross Margin / Profit $20k+ (20%+)
20%

You pay 5–10× the actual delivery cost to fund the vendor's go-to-market engine

Figure 4.1: Typical SaaS cost breakdown—only 10-20% goes to actual software delivery

AI Eliminates the Sales Process

Traditional procurement takes 3–9 months because:

  • Requirements gathering is labour-intensive
  • Vendor evaluation requires extensive demos
  • Contract negotiation is adversarial and time-consuming

AI collapses this entire cycle.

What AI Can Do in Days
1
Interview stakeholders
Natural language conversations to gather requirements
2
Synthesise requirements
Analyse patterns, identify gaps, suggest solutions
3
Analyse existing workflows
Map current processes and integration points
4
Generate working prototype
Production-ready code in days, not months

Result: No sales team needed. No marketing spend required. No demo theatre. No RFP kabuki.

"When technology can discover requirements and generate solutions faster than a sales engineer can schedule a demo call, the traditional vendor model collapses."

The Shelf Software Compromise

For decades, customers accepted these compromises because shelf software was the only affordable option:

❌ 70% Workflow Fit

The other 30% requires workarounds, integrations, or process changes

❌ Feature Bloat

Paying for hundreds of features you'll never use because other customers need them

❌ Vendor Timelines

Forced upgrades, deprecation schedules, end-of-life notices on their calendar, not yours

❌ Lock-In Economics

High switching costs and integration dependencies that trap you for years

❌ Implementation Drag

6–18 months to deploy and customise even "out of the box" solutions

❌ GTM Tax

Paying 5-10× delivery cost for sales, marketing, and channel overhead

This compromise made economic sense when the alternative was $2 million in bespoke development costs.

It makes zero sense when AI can deliver 100% fit for a fraction of the shelf subscription price.

The Inflection Point

We've crossed a threshold where three conditions are simultaneously true for the first time in software history:

The Three Convergent Truths

1. Cost Inversion

AI-generated bespoke ($50k) < Shelf software subscription ($100k/year)

2. Fit Superiority

Bespoke workflow fit (100%) > Shelf compromise (70%)

3. Speed Advantage

Generation time (days/weeks) < Implementation time (months/years)

When the custom solution is cheaper, better, AND faster than the off-the-shelf alternative, the economic rationale for shelf software evaporates.

The Only Reason to Buy Shelf Software Was Cost Amortisation.

That reason no longer exists.

Chapter Takeaways

  • The entire history of software is a history of cost amortisation—from bespoke to shelf to cloud
  • AI has collapsed software creation costs to near-zero, invalidating the economic rationale for shelf software
  • Customers pay 5-10× actual delivery costs to fund vendor GTM (sales, marketing, channels)—that tax is obsolete
  • AI can generate bespoke software faster than vendors can complete a sales cycle—the procurement model is obsolete
  • For the first time, bespoke is cheaper, better fit, AND faster than shelf—all three conditions true simultaneously

References & Further Reading

  • • Christensen, C. (1997). The Innovator's Dilemma. Harvard Business Press. (Amortisation economics in software)
  • • SaaS Capital (2024). SaaS Benchmarking Report. CAC, LTV, and gross margin metrics for enterprise SaaS.
  • • McKinsey (2023). The Economic Potential of Generative AI. Cost reduction in software development.
Part II: Why Shelf Software Is Obsolete

Chapter 5: The End of Off-the-Shelf

When economic conditions shift, business models collapse

If Chapter 4 explained why shelf software is dying economically, this chapter is about what happens when those economics actually shift in the market. Spoiler: the shift is already underway.

The Value Proposition That No Longer Holds

Off-the-shelf software survived for decades on a simple value proposition:

"It's not perfect, but it's affordable and fast to deploy."

Both parts of that proposition are now false for AI applications.

❌ Not "Affordable" Anymore

You're paying vendor GTM costs (sales, marketing, channel) that are 5–10× the actual delivery cost.

The Math:
• Shelf software: $100k/year
• AI-generated bespoke: $50k one-time
Bespoke is 2× cheaper in year 1, 100× cheaper by year 3
❌ Not "Fast" Anymore

"Fast" in the shelf world means 6–18 months of implementation, integration, data migration, customisation, and training.

The Reality:
• Shelf implementation: 6-18 months
• AI generation: Days to weeks
AI delivers 100% workflow fit in 10-50× less time

Distribution Was the Moat—And AI Just Drained It

The shelf software industry built its competitive advantage on distribution, not technology.

AI Destroys Distribution as a Moat

Sales Force → Overhead

When software can be generated on-demand from requirements, the sales force becomes overhead, not a competitive advantage

Ecosystem Lock-In → Evaporates

When applications are composable and ephemeral, ecosystem lock-in evaporates—apps are code you control

Implementation Complexity → Barrier Removed

When integration is handled by AI agents rather than professional services teams, implementation complexity stops being a barrier to entry

The "Safe" Choice Is No Longer Safe

What shelf vendors really sold was risk reduction: "Nobody ever got fired for buying Oracle."

That risk reduction came from:

  • Brand trust — decades of market presence
  • Market share — "everyone else uses it"
  • Conventional wisdom — the implicit safety of standard decisions
"Playing it safe by buying Oracle is how you end up with a 42% abandonment rate. The safe choice became the risky one."
— CIO, Fortune 500 Manufacturing Company

Composition vs Installation

The organisations getting value from AI aren't the ones buying shrink-wrapped "AI solutions" from legacy vendors.

They're the ones treating AI as a capability to compose, not a product to install.

❌ Installation Mindset (Failing)
  • Buy shrink-wrapped "AI solution"
  • Deploy over 6-18 months
  • Accept 70% workflow fit
  • Lock into vendor's roadmap
  • Hope it works, then abandon when it doesn't
✓ Composition Mindset (Succeeding)
  • Compose capabilities on stable platform
  • Generate apps in days/weeks
  • Achieve 100% workflow fit
  • Swap models quarterly as they improve
  • Measure, tune, scale what works

That compositional approach is fundamentally incompatible with the shelf software model. You can't compose what's locked in a vendor's proprietary wrapper.

Why Shelf Software Is Dying

Shelf software isn't dying because it's badly built. Oracle's database technology is excellent. Salesforce's CRM is feature-rich. SAP's ERP handles complex processes.

Shelf software is dying because the economic conditions that made it rational no longer exist.

The Three Conditions That No Longer Hold
1
Amortisation advantage
When bespoke cost $2M, spreading across 1,000 customers made sense. When bespoke costs $50k, amortisation is irrelevant.
2
Distribution as moat
When AI can generate software faster than sales cycles, the sales force becomes overhead, not an asset.
3
Risk reduction through convention
When 42% of shelf-based AI projects fail, the "safe" choice is now the risky one.

Chapter Takeaways

  • The shelf software value proposition—"not perfect, but affordable and fast"—is now false on both counts
  • AI destroys distribution as a competitive moat—sales forces become overhead when software generates faster than demo cycles
  • The "safe" choice (buying Oracle/SAP/Salesforce) is now risky—42% abandonment rates prove conventional wisdom failed
  • Winners treat AI as a capability to compose, not a product to install—composition is incompatible with shelf software
  • Shelf software is dying because economic conditions (amortisation, distribution, risk reduction) no longer exist
Part II: The Diagnosis

Chapter 6: The Perishability Crisis

Why AI models expire faster than enterprise contracts—and what that means for procurement

AI models have a shelf life measured in months, not years. This creates an existential problem for traditional software contracts that lock organisations into technology that will be obsolete before the ink dries.

The Model Generation Gap

GPT-3.5 launched in March 2022 to widespread amazement. It could hold conversations, write code, and summarise documents with unprecedented fluency. Organisations that moved quickly signed contracts in Q2 2022, locking in what felt like cutting-edge capability.

By March 2023—exactly one year later—GPT-5 made GPT-3.5 look like a toy. Better reasoning across complex tasks. Context windows expanded from 4,096 tokens to 32,768 tokens, allowing it to process entire codebases or lengthy legal documents. Dramatically more reliable outputs with fewer hallucinations. GPT-3.5 went from "state of the art" to "legacy technology" in 365 days.

But the evolution didn't stop there. By November 2023, GPT-5 Turbo arrived with updated knowledge cutoffs and reduced pricing. Throughout 2024, Claude Sonnet 4.5, Gemini Pro 3, and rapidly-improving open models like Llama 3 pushed the frontier even further. By late 2025, GPT-5, Claude Sonnet 4.5, and Gemini Pro 3 are all shipping, while open models iterate at speeds that make proprietary vendors nervous.

The capability you bought in 2023 is now perishable goods, but your three-year contract treats it like durable infrastructure.

"We're contractually obligated to use a model that's three generations old. Our competitors upgraded six months ago. We can't, unless we renegotiate the entire deal."
— CIO, Global Financial Services Firm

Traditional Software Aged Gracefully

This perishability problem is unique to AI. Traditional enterprise software aged slowly and predictably. SAP ERP from 2015 is still fundamentally useful in 2025—the core business logic for managing inventory, processing payroll, or tracking orders hasn't changed. Customer requirements don't fundamentally shift. Accounting principles remain stable. You could sign a five-year contract and expect reasonable value throughout the lifecycle.

Even SaaS platforms like Salesforce or Workday operate on annual or quarterly update cycles, with changes that are largely additive. The CRM you deployed in 2020 still performs its core function in 2025. New features arrive, but the foundation remains solid.

Software Lifecycles: Then vs Now

Traditional Enterprise Software
  • Version lifecycle: 3-5 years
  • Core capabilities: Stable
  • Update frequency: Annual/quarterly
  • Breaking changes: Rare
  • Contract fit: Perfect
AI Models
  • Version lifecycle: 6-12 months
  • Core capabilities: Rapidly improving
  • Update frequency: Weekly/monthly
  • Breaking changes: Common
  • Contract fit: Catastrophic mismatch

Perishability Isn't Just Performance—It's Economics

The cruelest twist: newer AI models aren't just better—they're often cheaper. GPT-5 is both more capable and more cost-effective than GPT-5 Turbo on a per-token basis. Claude Sonnet 4.5 offers superior reasoning at competitive pricing. Open models like Llama 3.3 deliver strong performance with zero API costs if you self-host.

When newer models offer superior quality at lower unit cost, sticking with an older model isn't just a capability gap—it's an economic penalty. You're paying more for worse results, not because you chose poorly, but because you chose once and couldn't adapt.

This creates a compounding disadvantage. Not only are you delivering inferior service (lower accuracy, slower processing, more errors), you're also paying more to deliver it. Your competitors who can swap models quarterly are simultaneously improving quality and reducing costs. You're stuck in a contract that optimises for neither.

Vendors Build Lock-In Mechanisms By Design

The vendors know this dynamic, which is why many are building proprietary wrappers and lock-in mechanisms. They can't compete on the underlying model—everyone has API access to the same LLMs from OpenAI, Anthropic, or Google. So they compete on making it difficult to leave.

Lock-in takes several forms:

  • Proprietary data formats: Your fine-tuning data, evaluation sets, and prompt libraries are stored in vendor-specific formats that don't export cleanly
  • Custom integrations: The vendor builds connectors to your CRM, ERP, and internal systems using their proprietary APIs, creating switching costs measured in engineering-months
  • Black-box orchestration: How prompts are constructed, how outputs are post-processed, and how confidence scores are calculated remain opaque, making it impossible to replicate the workflow elsewhere
  • Long implementation cycles: Nine-month deployments with extensive customisation mean switching vendors requires re-implementing everything from scratch

These aren't bugs—they're business model features. When the underlying technology commoditises rapidly (everyone can access GPT-5 or Claude Sonnet 4.5), the only defensible position is making migration painful.

"The vendor kept saying 'we use best-in-class AI models.' What they meant was 'we lock you to whichever model we chose, and upgrading requires renegotiating your contract.'"
— VP of Engineering, E-Commerce Platform

Perishability Cuts Both Ways

But perishability creates an opportunity, not just a trap. When models improve rapidly, the value of vendor lock-in decreases. Why pay for a three-year contract with complex migration penalties when the technology you're locked into will be obsolete in six months anyway?

Smart organisations are inverting the dynamic. Instead of signing long-term contracts that assume stability, they're structuring procurement and architecture for continuous renewal:

Architecting for Perishability

Swappable Model Adapters

Abstract the LLM interface so changing from GPT-5 to Claude Sonnet 4.5 to Llama 3.3 is a configuration change, not a code rewrite

Quarterly Refresh Cycles

Evaluate new models every 90 days against your evaluation harness, swap when better options emerge

Decoupled Contracts

Separate platform infrastructure (durable, multi-year) from model access (ephemeral, quarterly renewal)

This isn't optional sophistication—it's table stakes for avoiding the 42% abandonment rate. When your model strategy assumes continuous evolution, you're positioned to capture improvements rather than being trapped by obsolescence.

The Five-Year Illusion

Traditional software vendors thrived on five-year enterprise agreements because switching costs were genuinely high. Migrating from Oracle to SAP meant re-implementing business processes, retraining staff, and risking operational disruption. The cost of staying put was lower than the cost of moving.

AI inverts this calculus. The cost of staying put—being locked to obsolete models with inferior performance and higher costs—now exceeds the cost of moving, if you've architected for swappability. The organisations structuring their AI procurement for continuous renewal aren't being agile for agility's sake. They're acknowledging that perishability is the new normal.

The winning organisations are those that treat AI models like fresh produce, not like infrastructure. You don't sign a three-year contract for strawberries in January and expect them to be fresh in December. Why would you do that with AI models?

Chapter Takeaways

  • AI models have a shelf life of 6-12 months—three-year contracts lock you into planned obsolescence
  • Newer models are both better and cheaper—staying locked to old models is an economic penalty, not just a capability gap
  • Vendors build proprietary lock-in because they can't compete on the underlying models—everyone has access to the same LLMs
  • Architecture for swappability—model adapters, quarterly refresh cycles, and decoupled contracts—is table stakes for avoiding obsolescence
  • Treat AI models like fresh produce, not infrastructure—continuous renewal is the new procurement normal

References & Further Reading

  • • OpenAI (2022-2025). GPT Model Releases and Pricing. Model evolution timeline and capability benchmarks.
  • • Anthropic (2023-2025). Claude Model Family. Performance comparisons and pricing structures.
  • • Various (2025). Open Source LLM Benchmarks. Llama, Mistral, and community model performance tracking.
Part II: The Diagnosis

Chapter 7: Feature Lists Are Dead

Why probabilistic capabilities can't be evaluated with deterministic checklists

Traditional software procurement relied on feature checklists. You'd list 50 required capabilities, 30 nice-to-haves, and 20 deal-breakers. Vendors would demo their way down the list, checking boxes until someone won the RFP. For AI, this approach isn't just ineffective—it's actively misleading.

How Feature Checklists Used to Work

The feature checklist model made perfect sense when features were deterministic and testable. Either the software can export to CSV or it can't. Either it supports single sign-on with Active Directory or it doesn't. Either the CRM integrates with Salesforce or it requires manual data entry. Binary, verifiable, contract-able.

Procurement teams would create exhaustive lists:

  • ✓ Multi-factor authentication
  • ✓ Role-based access control
  • ✓ RESTful API with OAuth 2.0
  • ✓ SOC 2 Type II compliance
  • ✓ 99.9% uptime SLA

Vendors would respond with "yes," "no," or "roadmap Q3 2024." You could score responses objectively, compare vendors side-by-side, and make procurement decisions based on coverage percentage. The vendor covering 90% of requirements beats the one covering 75%. Simple math.

AI Capabilities Don't Work That Way

Now consider AI features. "Can it summarise documents?" On the surface, this seems like a reasonable yes/no question. But the real answer is: yes, but summarise well or poorly? For what kinds of documents? With what failure modes?

Every AI capability is probabilistic and context-dependent. Saying "we have AI-powered document summarisation" is like saying "we have transportation"—technically true but meaninglessly vague. A bicycle and a helicopter both provide transportation, but you wouldn't choose between them based on a feature checklist that asks "Does it move people from point A to point B?"

"The vendor demo showed perfect document summarisation. In production, it hallucinated citations, missed critical clauses, and required more review time than just reading the document."
— General Counsel, Healthcare Technology Company

The Questions Behind the Features

What matters isn't whether the feature exists, but whether the capability performs reliably on your tasks with your data in your workflows. And the only way to know that is through empirical evaluation, not vendor demonstration.

From Features to Performance

Old Question

"Do you support document summarisation?"

New Question

"Can you show offline eval results for summarisation on documents like ours, with disclosed failure modes?"

Old Question

"Is there AI-powered search?"

New Question

"What's the precision/recall on our domain after indexing our actual data?"

Old Question

"Can it generate code?"

New Question

"What percentage of generated code passes our test suite without human modification?"

Old Question

"Does it handle customer queries?"

New Question

"What's the false positive rate on escalations, and how does accuracy vary by query category?"

The shift is from features (does it exist?) to performance (does it work for us?)

How Feature Lists Enable Deception

Feature checklists let vendors show demos with cherry-picked examples. They can demonstrate AI document summarisation on a clean, well-structured 10-page PDF with straightforward content. What they won't show you:

  • Performance on 200-page scanned contracts with tables, appendices, and marginalia
  • Behaviour when critical information appears in footnotes or image-embedded text
  • Accuracy on domain-specific jargon your industry uses but wasn't in the training data
  • Failure modes when documents contain conflicting information or ambiguous phrasing
  • Degradation on edge cases like multi-language documents or non-standard formatting

The demo shows capability existence. It reveals nothing about production performance, which is the only thing that matters.

What Smart Buyers Demand Instead

Smart buyers are shifting from feature checklists to performance audits. They're treating vendor claims as hypotheses to test, not facts to accept. Here's what the new procurement looks like:

Performance-Based Procurement Requirements

1. Model Cards and Evaluation Reports

Demand documentation showing how the AI was trained, on what data, with what biases detected, and performance benchmarks on standard tasks

2. Red-Team Results

Request evidence of adversarial testing—where does the system systematically fail? What prompts cause harmful or incorrect outputs?

3. Test on Your Data Before Signing

Provide 200-500 representative examples from your actual workflows. Demand offline evaluation results with precision, recall, and disclosed failure modes

4. Access to Evaluation Harness

Require the ability to re-run tests on your data as models evolve, so you can verify that performance improvements are real

5. Confidence Scores and Failure Analysis

Insist on confidence scoring for every output, plus documentation of systematic failure patterns (e.g., "performs poorly on documents >50 pages")

This isn't about being difficult—it's about demanding evidence that the capability actually works for your use case. Vendors who can't provide this documentation aren't selling AI capabilities; they're selling promises wrapped in feature lists.

The Performance Audit Mindset

The shift from features to performance requires a fundamental change in how procurement operates. Instead of asking "Does your product have feature X?" you're asking "Can you prove feature X works on tasks like ours?"

This means:

  • Empirical testing replaces vendor demos: You run the AI on your data, not watch the vendor run it on theirs
  • Metrics replace checkboxes: Precision, recall, F1 scores, cost per request, time savings measured in hours—not "yes/no" on feature lists
  • Failure modes get disclosed upfront: "Works 95% of the time on X, performs poorly on Y" beats "Works on everything!" with hidden gotchas
  • Continuous evaluation replaces one-time verification: You're auditing ongoing performance, not just checking initial compliance
"We stopped asking 'Do you have this feature?' and started asking 'Can you prove it works on our data?' Half the vendors disappeared. The ones who stayed were the ones worth evaluating."
— VP of Procurement, Global Manufacturing

Why This Matters More Than Ever

As AI capabilities become table stakes, feature checklists become even less meaningful. When every vendor claims "AI-powered search," "AI-driven insights," and "machine learning optimization," the checklist becomes a list of identical claims with zero differentiation.

Performance evaluation is the only trustworthy signal. It forces vendors to prove their claims on your tasks, with your data, under your constraints. It reveals the gap between marketing promises and production reality before you sign a multi-year contract.

When features are probabilistic and context-dependent, the only trustworthy evaluation is empirical performance on representative tasks. Anything else is AI theater dressed up as procurement.

Chapter Takeaways

  • AI capabilities are probabilistic and context-dependent—feature checklists reveal existence, not performance
  • Vendor demos showcase cherry-picked examples—they tell you nothing about production performance on your data
  • Smart buyers demand model cards, evaluation reports, red-team results, and tests on their actual data before signing contracts
  • The shift is from "does it exist?" to "does it work for us?"—from features to performance, from promises to proof
  • Vendors who can't provide evaluation data and failure analysis aren't selling capabilities—they're selling feature-list promises

References & Further Reading

  • • Gartner (2025). Top Strategic Technology Trends: AI-Washing and Agent-Washing Risks.
  • • Mitchell, M. et al. (2019). Model Cards for Model Reporting. Framework for documenting ML model performance.
  • • Various (2025). Red Teaming Large Language Models. Best practices for adversarial testing.
Part III: The Solution

Chapter 8: Composable Architecture Explained

Building for perishability: the three-layer model that lets you adapt faster than the market

If shelf software is dying and pure bespoke drowns you in maintenance costs, what's the alternative? The answer is composable bespoke on a stable platform—a three-layer architecture that lets you buy commodity infrastructure, abstract perishable models, and generate task-specific applications that live only as long as they provide value.

The Three-Layer Model

Think of AI architecture as three distinct layers, each with different economics, lifecycles, and procurement strategies. Getting this separation right is the difference between flexibility and lock-in, between continuous improvement and planned obsolescence.

The Three Layers of AI Architecture

1
Foundation: Durable Infrastructure

Buy commodity services that benefit from economies of scale—identity, data, observability, security. Lifecycle: 3-5 years.

2
Abstraction: Model Adapters

Abstract LLM interfaces so swapping models is a config change, not a rewrite. Lifecycle: Continuous refresh (quarterly).

3
Applications: Ephemeral Micro-Apps

Generate task-specific apps that do one thing well, run on swappable models, and retire when obsolete. Lifecycle: Weeks to months.

Layer 1: Buy the Foundation (Durable Infrastructure)

Some capabilities benefit massively from economies of scale and don't differentiate your business. You're not competing on who has better identity management or more sophisticated logging. Buy these as managed services and move on.

What belongs in Layer 1:

  • Identity and access management: Auth0, Okta, Azure AD, AWS Cognito—you're not competing on login flows. Buy mature, compliant IAM and forget about it.
  • Data infrastructure: Managed databases (Postgres, MySQL), data warehouses (Snowflake, BigQuery), vector stores (Pinecone, Weaviate, Qdrant). Commodity infrastructure that benefits from vendor scale.
  • Observability and monitoring: Datadog, Grafana, New Relic, Splunk—logging, monitoring, tracing, alerting. Buy mature stacks with integrations to your existing tools.
  • Security and policy enforcement: Encryption, key management, secrets rotation, compliance controls—AWS KMS, HashiCorp Vault, cloud-native security services.
  • CI/CD and deployment: GitHub Actions, GitLab CI, CircleCI, your existing DevOps tooling. Don't reinvent deployment pipelines.

This is the platform. It's boring, it's mature, and everyone uses roughly the same stack. The competitive differentiation comes from what you build on top of it, not from reinventing auth or monitoring.

Layer 2: Assemble with Swappable Adapters (Model Abstraction)

Don't hardcode model dependencies into your application logic. Use adapters that let you swap GPT-5 for Claude Sonnet 4.5 for Llama 3.3 without rewriting code. This is the layer that acknowledges perishability and gives you an upgrade path.

What belongs in Layer 2:

  • Model routing layer: Abstract the LLM API so changing providers is a config change. Example: instead of calling openai.ChatCompletion.create() directly, call llm.complete() where llm is configured externally.
  • Prompt management system: Store prompts separately from code, version them in git, A/B test different formulations. When you find a better prompt, deploy it without code changes.
  • Evaluation harnesses: Test any model against your gold evaluation data before deploying it. Compare GPT-5 vs Claude Sonnet 4.5 vs Llama on your tasks, pick the winner based on metrics.
  • Fallback chains and routing logic: Try GPT-5 first. On timeout or low confidence, fall back to Claude Sonnet. On ambiguous cases, escalate to human review. Route intelligently based on task complexity and model strengths.
  • Cost and performance tracking: Log every model call with cost, latency, confidence score, and outcome. Know which models work best for which tasks, and what they cost per request.

This is your model abstraction layer. It acknowledges that models are perishable and gives you an upgrade path that doesn't require vendor renegotiation or application rewrites. When GPT-6 launches or a fine-tuned Llama model outperforms commercial APIs, you can swap it in and measure the improvement empirically.

"We spent two weeks building a model abstraction layer. It's saved us six months of rewrites across three major model upgrades."
— VP of Engineering, FinTech Startup

Layer 3: Compose Task-Specific Micro-Apps (Ephemeral Applications)

This is where the magic happens. Generate small, focused applications that do one thing well, integrate with your platform (Layer 1), run on swappable models (Layer 2), and live only as long as they provide value.

Examples of Layer 3 applications:

Task-Specific Micro-Apps in Action

Contract Review Agent

Reads NDAs and vendor agreements, flags risky terms based on your legal playbook, routes to legal team when uncertain. Confidence thresholds tuned for your risk tolerance.

Lifecycle: Generated in one week, runs indefinitely or gets replaced when legal requirements change

Customer Query Triage

Categorises support tickets, suggests responses for common queries, escalates complexity cases. Trained on your historical ticket data with your resolution patterns.

Lifecycle: Spun up for a product launch, runs for 6 months, retired when query volume stabilises

Invoice Data Extraction

Pulls structured data from PDF invoices, validates against schema, populates your accounting system. Handles multiple invoice formats from different vendors.

Lifecycle: Built for an integration project, runs indefinitely or gets replaced when invoice formats change

Meeting Summarisation Bot

Joins video calls, transcribes discussion, generates action items and decisions, posts summary to Slack. Integrates with your calendar and project management tools.

Lifecycle: Generated for executive team, rolled out to all teams if successful, retired if adoption is low

These apps are ephemeral—disposable by design. Cheap to build (AI-generated), run on stable platform (Layer 1), swap models freely (Layer 2).

These applications are ephemeral by design. They're cheap to build because AI assists in code generation. They run on a stable platform (Layer 1) so infrastructure is reliable. They can swap models freely (Layer 2) so they stay current as capabilities improve. When requirements change or better models arrive, you regenerate or retire them without disrupting the foundation.

Why This Model Wins

Composable bespoke architecture delivers five critical advantages that neither shelf software nor pure bespoke can match:

This Isn't "Build Everything Yourself"

Composable architecture is often misunderstood as "build everything from scratch." It's not. You're buying heavily for Layer 1 (infrastructure), buying or building thin adapters for Layer 2 (model abstraction), and generating Layer 3 (applications) with AI assistance.

The distinction matters:

  • Traditional bespoke: Build your own database, auth system, monitoring, everything. Maintenance nightmare, high ongoing costs.
  • Composable bespoke: Buy infrastructure that benefits from scale (Layer 1), build thin abstraction layers (Layer 2), generate task-specific apps (Layer 3). Maintenance focused on business logic, not plumbing.

You're not reinventing identity management or database engines. You're buying the boring infrastructure and focusing your engineering effort on the applications that differentiate your business.

"We buy the infrastructure that benefits from scale. We generate the applications that need to fit perfectly. That's the strategy."
— CTO, Healthcare Analytics Platform

The Path Forward

Composable architecture isn't a distant future vision—it's available today with existing tools. Cloud infrastructure is commodity. Model APIs are accessible. Code generation with AI assistance is production-ready. The organisations in the 5% aren't waiting for better technology—they're using what exists to build differently.

The shift is from "buy a product that almost fits" to "buy infrastructure, generate apps that fit exactly, swap components as technology improves." That's how you avoid the perishability trap, the lock-in spiral, and the 42% abandonment rate.

Chapter Takeaways

  • Composable architecture has three layers: durable infrastructure (buy), model abstraction (adapt), ephemeral apps (generate)
  • Layer 1 (infrastructure) should be boring commodity services—identity, data, observability, security
  • Layer 2 (model abstraction) enables swapping GPT-5 for Claude or Llama with a config change, not a code rewrite
  • Layer 3 (apps) are task-specific, AI-generated, ephemeral—disposable when obsolete, cheap to regenerate
  • This model delivers 100% fit, cost efficiency, speed, adaptability, and zero lock-in—advantages neither shelf software nor pure bespoke can match

References & Further Reading

  • • Richardson, C. (2018). Microservices Patterns. Manning Publications. Foundation for composable architecture thinking.
  • • Various (2025). LangChain, LlamaIndex Documentation. Tools for building model-agnostic AI applications.
  • • Fowler, M. (2014). Microservices. martinfowler.com. Architectural patterns for composability.
Part III: The Solution

Chapter 9: Hypothesis-Driven Procurement

Treating AI use cases as falsifiable experiments, not requirements documents

Traditional procurement starts with requirements: "We need software that does X, Y, and Z." Then you find a vendor who claims to do X, Y, and Z, negotiate a contract, and deploy. The assumption is that if the software meets the spec, it will deliver value. For AI, this is backwards. You don't know what will deliver value until you test it empirically.

The Requirements-First Model Breaks Down

The traditional model assumes you can specify outcomes before you test capability. Write down what you need, vendors bid on delivering it, you verify delivery against the spec. This works when capabilities are knowable and stable.

But AI capabilities aren't knowable until you test them on your data, in your workflows, with your edge cases. You can't specify "AI shall categorise support tickets with 90% accuracy" without first knowing whether 90% is achievable, what it costs to get there, and whether the remaining 10% creates unacceptable failure modes.

Hypothesis-Driven Procurement: The Framework

Hypothesis-driven procurement flips the model. Instead of gathering requirements, you formulate testable hypotheses. Instead of RFPs and demos, you run experiments. Instead of long contracts, you fund in stages based on empirical results.

Here's how it works:

The Five-Step Hypothesis Framework

1
Define the Hypothesis

Formulate a testable claim with baseline, target delta, and measurable outcomes. Not "we need AI support" but "an LLM can reduce response time from 4 hours to 1 hour with 85% categorisation accuracy."

2
Build an Evaluation Harness

Create gold data (200-500 examples), run offline testing, measure accuracy/precision/recall. If you can't beat baseline offline, kill the hypothesis immediately.

3
Shadow Deployment

Run AI in parallel with current processes. Humans work as usual, AI suggestions logged but not acted on. Reveals real-world failure modes that offline testing missed.

4
Bounded Production Rollout

Deploy to limited scope—10% of users, one product line, low-severity cases only. Measure relentlessly: did you hit target improvements? What are unexpected costs?

5
Kill or Scale Based on Unit Economics

If bounded rollout hits targets and unit economics work, scale it. If not, kill it fast and document why. No sunk cost fallacy, no "let's give it more time."

Step 1: Define the Hypothesis

Instead of "We need AI-powered customer support," formulate a testable claim with three components:

  1. Baseline: Current performance you're trying to improve (e.g., "4-hour average response time, manual categorisation taking 15 minutes per ticket")
  2. Target delta: Specific improvement you're hypothesising (e.g., "85% categorisation accuracy, 60% response acceptance rate, 1-hour response time")
  3. Measurable outcomes: Metrics you can verify empirically (accuracy, acceptance rate, time-to-response, cost per transaction)

Example hypothesis:

"An LLM can categorise incoming support tickets with 85% accuracy and suggest responses that agents accept without modification 60% of the time, reducing average response time from 4 hours to 1 hour."

This hypothesis is falsifiable. You can test it, measure the results, and either confirm or reject it based on data. It's not a requirement—it's a claim about what might be possible.

Step 2: Build an Evaluation Harness

Before anything touches production, create a controlled test environment. This is where most organisations skip steps and pay for it later.

What you need:

  • Gold data: 200–500 historical examples with ground-truth labels. For support tickets, this means tickets with confirmed correct categories and ideal responses.
  • Offline scoring: Run the AI on gold data without human involvement. Measure precision (how many AI predictions are correct), recall (how many correct cases does AI catch), F1 score (harmonic mean of precision and recall).
  • Failure analysis: Identify systematic failure modes. Does it always miscategorise billing issues? Does it fail on tickets longer than 500 words? Does it struggle with non-native English?

Step 3: Shadow Deployment

If offline testing succeeds, run the AI in parallel with current processes without changing workflows. This reveals real-world complexity that gold data missed.

How shadow mode works:

  • AI categorises tickets and suggests responses, but outputs are logged, not acted on
  • Humans perform the work as usual—no disruption to operations
  • Log where AI matches human decisions, where it diverges, and where humans would override it
  • Run for 2-4 weeks to capture edge cases and drift patterns

Shadow mode reveals failure modes that offline testing missed. Maybe the AI performs well on historical data but poorly on recent tickets about a new product. Maybe accuracy is great overall but catastrophically bad on a specific ticket category that's high-stakes. You discover these issues in shadow mode, not production.

"Shadow deployment saved us. The AI looked perfect in offline testing. In shadow mode, we discovered it failed spectacularly on tickets mentioning our new product—which wasn't in the training data."
— Director of Customer Success, SaaS Platform

Step 4: Bounded Production Rollout

If shadow testing confirms the hypothesis, deploy to a limited scope where failure is contained and learning is fast:

  • Route only low-severity tickets through AI (avoid high-stakes cases until confidence is proven)
  • Enable AI suggestions for one product category (test with constrained complexity before expanding)
  • Let 10% of agents opt into AI-assisted workflows (early adopters who'll provide detailed feedback)
  • Measure relentlessly: Did response time actually drop to 1 hour? Is acceptance rate holding at 60%? Are there unexpected costs (more escalations, lower customer satisfaction, higher review time)?

Bounded rollout lets you capture most of the learning while limiting blast radius. If something goes wrong, it affects 10% of tickets, not 100%. If unit economics don't work at this scale, you kill before full deployment.

Metrics That Matter in Bounded Rollout

Performance Metrics
  • • Accuracy/precision/recall vs baseline
  • • Response time reduction (target: 4hrs → 1hr)
  • • Agent acceptance rate (target: 60%)
  • • False positive / false negative rates
Economic Metrics
  • • Cost per request (API + infrastructure + review)
  • • Time saved per agent (hours/week)
  • • Value generated ($ savings or revenue uplift)
  • • Unit economics: value - cost per 1000 requests
Quality Metrics
  • • Customer satisfaction (CSAT) change
  • • Escalation rate to senior support
  • • Error rate requiring correction
  • • Agent Net Promoter Score
Failure Metrics
  • • Systematic failure patterns by category
  • • Confidence calibration (is low confidence reliable?)
  • • Drift detection (performance degrading over time?)
  • • Edge case frequency and impact

Step 5: Kill or Scale Based on Unit Economics

After 30-90 days of bounded rollout, you have data. Now make the binary decision: kill or scale.

Scale if:

  • ✓ Performance hits target improvements (85% accuracy, 60% acceptance, 1-hour response time)
  • ✓ Unit economics are positive (value > cost by at least 2×)
  • ✓ Quality metrics remain stable or improve (CSAT, error rate, escalations)
  • ✓ Agent adoption is strong (high usage, positive feedback)

Kill if:

  • ✗ Performance misses targets by >15% (accuracy 70% when you need 85%)
  • ✗ Unit economics don't work (cost > value, or ROI < 2×)
  • ✗ Quality issues emerge (rising error rate, falling CSAT, high escalations)
  • ✗ Agent adoption is low despite training (workarounds, complaints, non-usage)

The kill decision is critical. Don't fall into sunk cost fallacy. Don't "give it more time" if data says it's not working. Document why you're killing it so you don't repeat the mistake, then move on to the next hypothesis.

Why This Works: Rigour Replaces Hope

Hypothesis-driven procurement forces rigour at every stage. It prevents "deploy and hope." It makes failure cheap—kill after offline testing (costs: $5k and 2 weeks) rather than after 6 months in production (costs: $500k and organisational trauma).

This is how the 5% get consistent AI value. Not by buying better products—by testing hypotheses faster and killing failures earlier. Not by perfect requirements gathering—by empirical experimentation with clear success criteria.

The framework aligns procurement with the scientific method. You're testing claims with progressively higher fidelity, killing bad hypotheses early, and scaling only what's been proven empirically. That's the difference between the 5% who win and the 42% who abandon.

Chapter Takeaways

  • Replace requirements-first procurement with hypothesis-driven experimentation—you can't know what works until you test it empirically
  • The five-step framework: define hypothesis, offline eval, shadow deployment, bounded rollout, kill or scale based on data
  • If AI can't beat baseline in offline testing, kill immediately—don't deploy and hope it improves in production
  • Shadow deployment reveals real-world failure modes that offline testing misses—run it before production rollout
  • Kill fast based on data, not sunk cost—offline eval kill costs $5k, production failure costs $500k+

References & Further Reading

  • • Ries, E. (2011). The Lean Startup. Crown Business. Foundation for hypothesis-driven experimentation.
  • • Kohavi, R., Tang, D., Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press.
  • • Various (2025). ML Testing & Evaluation Best Practices. Patterns for rigorous AI evaluation.
Part III: The Solution

Chapter 10: Capability Audits

Auditing what systems can actually do, not what vendors claim they can do

Feature checklists are dead (Chapter 7), but you still need to evaluate vendors and technologies. The replacement is capability audits—rigorous examination of whether the system can actually perform on your tasks, with your data, under your constraints.

From Feature Lists to Performance Verification

Traditional procurement asked "Do you have feature X?" Capability audits ask "Can you prove feature X works for us?" This isn't semantic nitpicking—it's the difference between vendor promises and production reality.

Here's what to audit instead of chasing feature lists:

The Six Dimensions of Capability Audits

1
Data Access & Quality

Can the AI work with your actual data, or only sanitised demos? Does it require data to leave your environment?

2
Evaluation Methodology

Can they show offline eval results on tasks like yours, with failure modes disclosed?

3
Safety & Controls

Can you route low-confidence outputs to human review automatically, or is it all-or-nothing?

4
Model Swap-Out Policy

When GPT-6 or Claude Sonnet 5 launches, can you upgrade without renegotiating your contract?

5
Observability

Can you inspect prompts, inputs, outputs, costs, and drift in real time, or is the system a black box?

6
Ownership & Exit

Who owns the data, models, and platform if you leave? What's your exit path if performance degrades?

Dimension 1: Data Access and Quality

Many AI solutions work beautifully on vendor-curated test data and fail spectacularly on messy enterprise reality. The first audit question: Can the vendor's AI work with your actual data?

Questions to ask:

  • Data format compatibility: Will it handle your legacy PDFs, scanned images, semi-structured CSVs, or does it require perfectly formatted JSON?
  • Data residency: Does the AI require data to leave your environment? If you're in healthcare or finance, can it run on-premises or in your VPC?
  • Data preparation tax: How much cleaning and transformation is required before the AI can process your data? If it's "weeks of ETL work," factor that into cost and timeline.
  • Volume handling: Demos show 100 examples. Can it handle your 10 million records without performance degradation?

Dimension 2: Evaluation Methodology

Can the vendor show you offline evaluation results on tasks similar to yours, with failure modes disclosed? If they can't or won't, that's a red flag—either they don't have the data or the results are bad.

What to demand:

  • Precision and recall on representative tasks: Not "90% accuracy" (meaningless without context) but "precision 0.85, recall 0.80 on documents similar to yours"
  • Disclosed failure modes: Where does the system systematically fail? ("Performs poorly on documents >50 pages" or "hallucinates citations 5% of the time")
  • Evaluation data selection methodology: Was the data cherry-picked for demos or representative sampling of real-world cases?
  • Model cards and red-team reports: Documentation showing training data sources, known biases, adversarial testing results
"We asked for evaluation data on our use case. Vendor A provided detailed metrics and failure analysis. Vendor B said 'trust us, it works great.' We went with Vendor A."
— Director of AI, Insurance Company

Dimension 3: Safety and Controls

Without safety controls, you're choosing between two bad options: check everything manually (productivity killer) or trust blindly (compliance disaster). Audit whether the system provides intelligent routing based on confidence and stakes.

Essential safety features:

  • Confidence scores: Does the system provide a confidence score for each output? Can you trust that score (is it calibrated)?
  • Configurable routing policies: Can you set rules like "escalate to human when confidence <0.7 OR stakes >$10k"?
  • Guardrails: Are there content filters to prevent harmful outputs, policy violations, or regulated disclosures?
  • Audit trails: Can you trace every decision the AI made, with full context, for compliance review?

Dimension 4: Model Swap-Out Policy

Models are perishable (Chapter 6). When GPT-6 or Claude Sonnet 5 or Llama 5 launches in 8 months, can you upgrade without renegotiating your contract? This isn't a hypothetical—it's a certainty.

What to verify:

  • Architecture: Is the system hardcoded to a specific model (lock-in trap) or does it use swappable adapters (freedom to upgrade)?
  • Decision control: Who controls the upgrade decision—you or the vendor? If the vendor, do they upgrade automatically (breaking your workflows) or never (leaving you stuck on old models)?
  • Migration path: What happens if the vendor's preferred model becomes obsolete or expensive? Can you bring your own model?
  • Cost implications: If newer models are cheaper, do you automatically benefit or is pricing locked?

A vendor response of "We use GPT-5" without an adapter layer is a red flag. Models are perishable; your architecture shouldn't be.

Dimension 5: Observability

Can you inspect prompts, inputs, outputs, costs, and drift in real time? Or is the system a black box where you deploy and pray?

Observability requirements:

What Full Observability Looks Like

Logging & Auditing
  • • Log every prompt and response
  • • Capture timestamps, user IDs, session context
  • • Export logs for compliance audits
  • • Retention policies you control
Cost Visibility
  • • Per-request cost breakdown
  • • Cost by user, team, workflow
  • • Budget alerts and caps
  • • Cost vs value dashboards
Performance Monitoring
  • • Real-time accuracy tracking
  • • Latency and error rates
  • • Confidence distribution analysis
  • • User override patterns
Drift Detection
  • • Track performance over time
  • • Alert when accuracy degrades
  • • Input distribution shifts
  • • Automatic re-evaluation triggers

Data ownership: You own all observability data, not the vendor. If they can't export it, you don't really have it.

If you can't see what's happening inside the system, you can't diagnose failures, optimise costs, or prove compliance. Black-box AI is a procurement non-starter.

Dimension 6: Ownership and Exit Rights

What happens if you need to leave? Who owns the data, the fine-tuning, the prompts, the evaluation harnesses? Can you take your work with you or are you starting from scratch?

Ownership audit:

  • Input/output data: Do you own all data that passes through the system? Can you export it in standard formats?
  • Fine-tuning and model weights: If you've fine-tuned models on your data, do you own those weights?
  • Prompts and orchestration logic: Are prompts stored in your git repo or locked in the vendor's platform?
  • Evaluation data and harnesses: Can you take your test sets and evaluation code with you?
  • Exit timeline: How long does migration take if you leave? 30 days or 12 months?

If the vendor won't commit to data portability and reasonable exit timelines, they're planning on lock-in being their retention strategy.

The Capability Audit Checklist

Before signing a contract or committing engineering time, demand clear answers to these six questions:

Pre-Contract Capability Audit

  1. 1. Can you show offline evals on tasks like ours, with failure modes disclosed?
  2. 2. Can we test it on our actual data (200-500 examples) before committing?
  3. 3. Is there a confidence signal and routing policy for human-in-the-loop triage?
  4. 4. What's the model swap-out plan when better/cheaper options appear in 6-12 months?
  5. 5. Do we get full observability: prompts, outputs, costs, drift metrics, all exportable?
  6. 6. Who owns the data, models, and platform if we leave? What's the exit timeline?

Vendors who can't answer these questions clearly aren't selling AI capabilities—they're selling promises wrapped in proprietary lock-in. The ones who can answer are the ones worth evaluating.

"We turned the capability audit into our standard RFP. Half the vendors withdrew. The half that stayed could actually answer the questions. Made the decision easy."
— VP of Technology, Global Logistics

Chapter Takeaways

  • Replace feature checklists with capability audits—verify the system works on your tasks, not vendor-curated demos
  • Audit six dimensions: data access, evaluation methodology, safety controls, model swap-out, observability, ownership
  • Demand testing on your actual data before signing—vendor demos on curated datasets reveal nothing about production performance
  • Without confidence scores and routing policies, you're stuck between "check everything manually" or "trust blindly"—both fail
  • Full observability (prompts, costs, drift) and data ownership are non-negotiable—black-box AI is a compliance and debugging nightmare

References & Further Reading

  • • Mitchell, M. et al. (2019). Model Cards for Model Reporting. Framework for documenting ML performance.
  • • Raji, I.D. et al. (2020). Closing the AI Accountability Gap. ACM FAccT. Audit frameworks for AI systems.
  • • Various (2025). AI Red Teaming and Adversarial Testing Best Practices.
Part III: The Solution

Chapter 11: Risk-Based Triage

How to capture AI productivity gains without eating compliance disasters

"Check everything the AI does" kills productivity. "Trust everything the AI does" invites disaster. The answer is risk-based triage: route outputs intelligently based on confidence, stakes, and consequences. This is how you capture the upside without eating the downside.

The Impossible Choice

Without triage, organisations face two equally bad options:

  • Option 1: Check everything manually. Review every AI output before it goes to production. This eliminates errors but also eliminates productivity gains. You've added an AI step to your workflow without removing human work—pure overhead.
  • Option 2: Trust blindly. Auto-approve all AI outputs and hope for the best. This captures productivity gains until something goes catastrophically wrong—a compliance violation, a customer-facing error, a regulatory fine.

Both options lead to the outcomes we're seeing: 42% abandonment (organisations giving up after realising neither option works) and early financial losses from AI-related errors (EY's 2025 survey finding).

The Triage Framework: Three Dimensions

Not all AI outputs carry equal risk. Intelligent triage accounts for three factors:

Three Dimensions of Risk-Based Triage

1
Confidence

How certain is the AI about this output? Provided as a score (0.0-1.0) or probability. Low confidence = higher risk.

2
Stakes

What's the cost of getting it wrong? Financial impact, legal exposure, reputational damage, safety risk. Higher stakes = more scrutiny required.

3
Reversibility

Can we catch and fix mistakes downstream, or is this a one-way decision? Reversible errors can be caught in QA; irreversible ones need prevention.

Based on these three dimensions, you define routing policies that balance productivity against risk. High confidence + low stakes + reversible = auto-approve. Low confidence OR high stakes OR irreversible = require human review.

Defining Routing Policies

Routing policies map combinations of confidence, stakes, and reversibility to actions. Here's a practical framework:

Example Routing Policy Table

Scenario Routing Action
High confidence (>0.9) + Low stakes (<$100) + Reversible Auto-approve, log for spot-checking
Medium confidence (0.7–0.9) + Medium stakes ($100-$10k) Queue for human review within 24 hours
Low confidence (<0.7) OR High stakes (>$10k) OR Irreversible Block until human approves
Anomaly detected (input outside training distribution) Escalate to specialist review
Compliance-sensitive category (legal, financial, healthcare) Mandatory review regardless of confidence

Thresholds should be tuned based on your risk tolerance, industry regulations, and empirical false positive/negative rates

Case Study: Customer Support Ticket Routing

Let's see triage in action for a concrete use case: AI-powered support ticket categorisation and response suggestions.

Without triage (the bad options):

  • ❌ Check every AI suggestion → Agents spend same time reviewing as they would drafting responses → Zero productivity gain
  • ❌ Trust all AI suggestions → Wrong categories, incorrect responses, customer complaints → Compliance nightmare

With triage (the smart approach):

Intelligent Routing Example: Support Tickets

AUTO
High Confidence, Low Stakes

Example: "Where's my order?" ticket categorised as "Order Status" with confidence 0.95

Action: Auto-route to fulfillment team, send templated response, log for audit. Zero human review needed.

Captures full productivity gain: 5-minute task → 30 seconds

REVIEW
Medium Confidence, Medium Stakes

Example: "I want a refund" ticket with confidence 0.80, potential $150 refund

Action: Queue for support agent review within 24 hours, show AI suggestion as starting point, agent modifies/approves.

Partial productivity gain: Agent reviews in 2 minutes vs drafts from scratch in 5 minutes

ESCALATE
Low Confidence or High Stakes

Example: "Your product injured me" with confidence 0.65 OR any mention of legal/safety concerns

Action: Immediate escalation to legal/risk team, no AI auto-response, flag for senior review. Human handles entirely.

No productivity gain but prevents catastrophic errors—worth the manual work

With this triage setup, you auto-approve 60% of tickets (high confidence, low stakes), queue 30% for quick review (medium confidence/stakes), and escalate 10% for full human handling (low confidence or high stakes). The result: 60% × 100% productivity gain + 30% × 50% gain + 10% × 0% gain = ~60% overall productivity improvement, without the risk of blind trust.

"Before triage, we were checking everything. Zero productivity gain. After implementing confidence-based routing, we're auto-approving 55% of tickets with zero errors. That's where the ROI came from."
— VP of Customer Support, E-Commerce Platform

Implementing Triage in Practice

Here's how to operationalise risk-based triage in your organisation:

Step 1: Demand Confidence Scores from Vendors

If the AI system can't tell you how confident it is about each output, you can't triage. Confidence scoring should be non-negotiable in capability audits (Chapter 10). Verify that scores are calibrated—when the AI says 90% confidence, it should be right 90% of the time.

Step 2: Define Routing Policies Upfront

Don't deploy first and figure out triage later. Before production, stakeholders must agree on:

  • Confidence thresholds for auto-approval (e.g., >0.9 for low-stakes, >0.95 for medium-stakes)
  • Stake thresholds for mandatory review (based on dollar value, regulatory risk, customer impact)
  • Escalation paths for edge cases (who reviews? What's the SLA?)
  • Compliance overrides (categories that always require human review regardless of confidence)

Step 3: Log Everything for Auditing

Even auto-approved decisions should be logged with full context: input, AI output, confidence score, timestamp, user ID. If you later discover a systematic failure (e.g., AI was miscategorising a specific ticket type for three months), you need to identify which outputs were affected for remediation or customer notification.

Step 4: Tune Thresholds Based on Outcomes

Triage isn't "set and forget." As you gather production data, continuously tune:

  • If false positives are high (AI says high confidence but is wrong), raise the auto-approve threshold
  • If false negatives are high (AI says low confidence but is actually right), lower the escalation threshold
  • If specific categories have worse performance, add category-specific routing rules
  • As model accuracy improves (new model deployed), re-tune thresholds accordingly

Why Triage Is Non-Negotiable

Risk-based triage isn't paranoia or excessive caution. It's how you capture AI's productivity upside without eating its error downside. The math is straightforward:

  • No triage, check everything: Productivity gain = 0%, error rate = 0%, organisation abandons AI as "not worth it"
  • No triage, trust everything: Productivity gain = 100%, error rate = 5-15%, organisation suffers compliance violations and reputation damage, abandons AI as "too risky"
  • With triage, intelligent routing: Productivity gain = 50-70% (auto-approve high confidence, quick review medium confidence), error rate < 1% (escalate low confidence and high stakes), organisation scales AI successfully

The 5% of organisations extracting consistent AI value aren't the ones with better models. They're the ones with triage systems that let them trust AI where it's reliable and escalate where it's not.

Chapter Takeaways

  • Without triage, you're stuck between "check everything" (no productivity gain) or "trust everything" (compliance disaster)
  • Risk-based triage uses three dimensions: confidence (AI certainty), stakes (cost of error), reversibility (can mistakes be caught?)
  • Auto-approve high confidence + low stakes, queue medium confidence/stakes for review, escalate low confidence or high stakes
  • Define routing policies before deployment, log all decisions for auditing, tune thresholds based on production data
  • With triage, capture 50-70% productivity gains while keeping error rates <1%—that's how the 5% win

References & Further Reading

  • • EY (2025). Most Companies Suffer Some Risk-Related Financial Loss Deploying AI. Survey findings on early AI losses.
  • • Various (2025). Human-in-the-Loop Machine Learning. Patterns for intelligent human-AI collaboration.
  • • Bansal, G. et al. (2021). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. CHI.
Part IV: The Operating Model

Chapter 12: The New Playbook

From waterfall procurement to continuous experimentation

If the old playbook was "gather requirements → issue RFP → negotiate contract → implement → deploy," what's the new one? Here's how procurement transforms when you acknowledge that AI is perishable, probabilistic, and requires continuous experimentation.

The Old Playbook: Waterfall by Design

Traditional software procurement operated on long cycles because switching costs were genuinely high and implementations took quarters or years:

  1. Requirements gathering: 3 months of stakeholder interviews, documentation, specification writing
  2. RFP and vendor selection: 4 months of RFP issuance, vendor demos, evaluation, negotiation
  3. Contract negotiation: 2 months of legal review, pricing discussions, SLA definition
  4. Implementation: 6-18 months of deployment, customization, integration, testing
  5. Go-live: Single cutover event, training, stabilization

Total timeline: 15-27 months from kickoff to production. This made sense when software was stable, requirements were knowable, and the technology you specified in month 1 would still be relevant in month 24.

The New Playbook: Procurement Becomes Continuous

AI procurement needs to operate on quarterly cycles, not multi-year contracts. Here's what changes:

From Waterfall to Continuous: What Changes

Model Refresh Cycles

Old way: Lock into specific model version for 3-5 years

New way: Evaluate new models every quarter, swap when better/cheaper options emerge. Model abstraction layer makes this a config change, not a renegotiation.

Performance Review

Old way: Annual business review focused on uptime SLAs

New way: Quarterly reassessment of whether deployed AI still beats baseline. Kill underperformers immediately, don't wait for contract expiration.

Cost Optimization

Old way: Fixed pricing for contract term, renegotiate at renewal

New way: As model pricing drops (GPT-5 cheaper than GPT-5 Turbo), capture savings immediately. Switch to better cost/performance models quarterly.

Contract Structure

Old way: 3-year commitment with early termination penalties

New way: Quarterly renewal with 30-day exit clause, model-agnostic architecture. Pay for platform (annual) and models (consumption-based).

Gates Replace Milestones

Traditional projects have milestones: requirements complete, design approved, development finished, UAT passed, go-live. AI projects need gates—decision points where you kill or continue based on empirical results.

The difference is critical. Milestones measure progress toward deployment. Gates measure evidence of value. You don't advance past a gate just because you completed the work—you advance only if the data says it's working.

The Four-Gate Model

Gate Decision Criteria Kill If...
Gate 1: Offline Eval Does AI beat baseline on gold data? Accuracy < baseline + 10%
Gate 2: Shadow Deploy Does it work on live data without interfering? Failure rate > 15% or systematic bias detected
Gate 3: Bounded Rollout Does it deliver target delta in production? ROI < 2× after 90 days
Gate 4: Scale Decision Do unit economics work at full scale? Cost per transaction > value delivered

Each gate is a kill decision, not a progress report. If you don't hit the criteria, stop funding and document why.

This gated funding model prevents the "sunk cost trap" where organisations keep investing in failing AI because "we've already spent so much." Each gate forces a binary decision: does the data support continuing, yes or no?

"We used to measure 'percent complete.' Now we measure 'evidence of value.' If you can't show ROI data, you don't get past the gate."
— CFO, Manufacturing Conglomerate

Contracts Support Experimentation

Traditional contracts specify deliverables and timelines. AI contracts need to specify evaluation criteria and exit ramps.

What new AI contracts look like:

  • Proof-of-value period: 30–90 day trial with clear success metrics written into the contract. Either party can exit with no penalty if metrics aren't hit.
  • Model swap rights: Customer can switch underlying models without vendor approval or additional fees. Contract binds platform access, not specific model versions.
  • Data ownership: Customer owns all inputs, outputs, fine-tuning data, and evaluation harnesses. No vendor lock-in through proprietary data formats.
  • Observability guarantees: Full access to prompts, logs, costs, performance metrics in exportable formats. Vendor can't hide black-box behavior behind proprietary interfaces.
  • Performance SLAs: Not uptime (easy to game), but accuracy/precision/recall on customer's evaluation data. Vendor commits to maintaining performance thresholds or customer can exit.

Funding Follows Value, Not Projects

Traditional IT budgeting funds projects: "We're allocating $2M for the CRM upgrade." AI budgeting should fund capabilities and release funds in tranches as value is proven.

Staged funding model:

  • Seed funding ($50k): Offline eval and shadow deployment. Low investment, fast learning. If this fails, you've killed it for $50k instead of $2M.
  • Tranche 2 ($200k): Released only if Gate 2 passes (shadow deploy succeeds, real-world failure modes understood, path to ROI clear).
  • Tranche 3 ($500k): Released only if Gate 3 passes (bounded rollout hits target metrics, unit economics positive at scale).
  • Full funding (scale investment): Released only when unit economics work at production volume. No faith-based funding.

This aligns spending with evidence. You don't fund the full implementation until you've proven the hypothesis with progressively higher fidelity testing.

Roles Change: From IT Project to Product Initiative

Traditional procurement is led by IT/ops teams executing stakeholder requirements. AI procurement needs a product manager with AI literacy, supported by cross-functional teams.

The New AI Procurement Team

AI Product Owner

Defines hypotheses, sets success criteria, decides which use cases to fund and which to kill. Not a traditional PM, not an engineer—a hybrid role with business and technical fluency.

Engineering Team

Builds evaluation harnesses, implements triage logic, manages model adapters. Treats AI deployment as continuous integration, not one-time install.

Business Owners

Define baselines, set success criteria, validate that improvements matter to operations. Ensure AI capabilities align with actual business needs, not hypothetical value.

Procurement & Legal

Negotiate contracts that support experimentation, exits, and continuous model refresh. Shift from "lock in vendor for 3 years" to "maintain optionality and adaptability."

It's not an IT project with a go-live date. It's a product initiative with continuous experimentation, measurement, and performance management.

The Playbook in Practice: Timeline Comparison

Let's compare timelines for deploying AI-powered customer support using old vs new playbooks:

Old Playbook (Requirements-First):
  • • Month 0-3: Requirements gathering, stakeholder alignment
  • • Month 3-7: RFP process, vendor demos, evaluation
  • • Month 7-9: Contract negotiation, legal review
  • • Month 9-15: Implementation, customization, integration
  • • Month 15: Go-live, then discover if it works

Total: 15 months to production, $2M+ spent before knowing if it works

New Playbook (Hypothesis-Driven):
  • • Week 1-2: Define hypothesis, create evaluation harness ($10k)
  • • Week 3-4: Offline eval with 3 vendor models, pick winner ($15k)
  • • Week 5-8: Shadow deployment, measure real-world performance ($25k)
  • • Month 3-4: Bounded rollout (10% of tickets), measure ROI ($100k)
  • • Month 5+: Scale if metrics hit, kill if they don't

Total: Value proven or hypothesis killed in 4 months, $150k invested before scale decision

The new playbook delivers 75% faster time-to-value and 95% lower sunk costs if you need to kill the initiative.

Chapter Takeaways

  • Procurement shifts from waterfall (3-year contracts) to continuous (quarterly refresh cycles with exit clauses)
  • Gates replace milestones—you advance based on evidence of value, not completion of tasks
  • Contracts support experimentation with proof-of-value periods, model swap rights, data ownership, and exit ramps
  • Funding follows value in tranches—$50k for offline eval, $200k for shadow deploy, full funding only when ROI is proven
  • AI procurement is a product initiative, not an IT project—requires AI product owner, not traditional project manager

References & Further Reading

  • • Ries, E. (2011). The Lean Startup. Validated learning and hypothesis-driven development.
  • • Cagan, M. (2017). Inspired: How to Create Tech Products Customers Love. Product management principles.
  • • Various (2025). Agile Procurement Frameworks for AI. Emerging best practices.
Part IV: The Operating Model

Chapter 13: Economics That Work

Making the math pencil out: from project budgets to unit economics

The 42% abandonment rate isn't just a deployment problem—it's an economics problem. Organisations are scaling proofs-of-concept, not value. Here's how to structure AI economics so they actually deliver return on investment.

Unit Economics Over Project Budgets

Traditional IT projects measure success by "on time, on budget, on spec." The project either completes or it doesn't. AI projects need to measure unit economics: dollars saved or revenue generated per 1,000 AI requests, after all costs.

This shift matters because traditional projects succeed when they deploy. AI projects succeed when they deliver ongoing value at sustainable cost. Deployment is just the beginning.

The Unit Economics Formula

Unit Value =

(Time Saved × Hourly Cost) + (Error Reduction × Cost per Error) + (Revenue Uplift)

Unit Cost =

(Model API Cost) + (Infrastructure Cost) + (Human Review Cost) + (Maintenance & Ops)

Net Value = Unit Value - Unit Cost

Success thresholds:

  • Net Value < 0: You're burning money. Kill immediately.
  • Net Value > 0 but < 2×: Barely breaking even after risk and opportunity cost. Questionable.
  • Net Value > 3×: Sustainable ROI that justifies deployment complexity. Deploy and scale.

Example: Customer Support Ticket Triage That Works

Let's work through a real scenario with actual numbers:

Current baseline:

  • 1,000 tickets per day coming into support queue
  • Average 15 minutes to read, categorise, and route each ticket
  • Support agent cost: $40/hour fully loaded → $10 per ticket
  • Total daily cost: $10,000 in labor

AI hypothesis:

  • AI categorises tickets in seconds, suggests responses based on category
  • Agents review and approve AI suggestions in 3 minutes instead of drafting from scratch in 15 minutes
  • 80% of tickets handled with AI assist, 20% too complex and escalated to full manual handling

AI economics breakdown:

Cost Structure Analysis
  • Model API cost: $0.10 per ticket (GPT-5 inference, categorization + response suggestion)
  • Infrastructure: $500/day (hosting, vector DB, observability, logging)
  • AI-assisted review cost: 800 tickets × 3 min × $40/hr = $1,600/day
  • Manual handling (20% escalated): 200 tickets × 15 min × $40/hr = $2,000/day
  • Total AI cost: $100 + $500 + $1,600 + $2,000 = $4,200/day

Net value calculation:

  • Daily savings: $10,000 (baseline) - $4,200 (AI) = $5,800/day
  • ROI: $5,800 / $4,200 = 138% daily return
  • Annual savings: $5,800 × 365 = $2.1M/year
  • Payback on $50k implementation: 9 days

Counter-Example: Contract Review That Fails Economics

Not every use case works financially. Here's why some AI deployments get killed despite technical success:

Current baseline:

  • 50 contracts per month reviewed by legal team
  • 2 hours per contract × $200/hour (attorney cost) = $400 per contract
  • Total monthly cost: $20,000 in legal review time

AI hypothesis:

  • AI reads contracts, flags risky clauses, suggests redlines based on playbook
  • Legal reviews AI output and makes final judgment calls
  • Expectation: Save 30-40% of attorney time per contract

AI economics (the disappointing reality):

Cost Structure Analysis
  • Model API cost: $2.00 per contract (long documents, complex prompts, multiple passes)
  • Infrastructure: $500/month (hosting, fine-tuned model, observability)
  • Attorney review cost: Legal still needs 90 minutes to verify AI suggestions and make judgment calls = $300 per contract
  • Total monthly AI cost: ($2 × 50) + $500 + ($300 × 50) = $100 + $500 + $15,000 = $15,600/month

Net value (underwhelming):

  • Monthly savings: $20,000 - $15,600 = $4,400/month
  • ROI: $4,400 / $15,600 = 28% return
  • But: Only saves 30 minutes per contract, and attorneys report lower confidence in AI-flagged issues
  • Reality: Time "saved" is spent verifying AI didn't miss critical clauses

Avoiding the "Scaled Proof" Trap

Gartner and S&P's failure statistics are full of organisations that scaled proofs-of-concept instead of scaling proven value. Here's how it happens:

  1. POC works on 100 test cases → Impressive demo, stakeholders excited
  2. Exec sponsors greenlight full deployment → Budget approved, team mobilized
  3. Production rollout reveals edge cases → Accuracy degrades on real-world diversity
  4. Hidden costs emerge → More human review needed than predicted, infrastructure costs higher than expected
  5. Unit economics flip negative at scale → What worked for 100 cases loses money at 10,000
  6. Project gets quietly shelved → Another contribution to the 42% abandonment rate
"The POC was perfect. But perfect on 100 cherry-picked examples. At 10,000 production cases, the accuracy dropped 20 points and costs tripled. We scaled the proof, not the value."
— Director of AI, Telecommunications Company

To avoid this trap: measure unit economics at every gate (Chapter 12). Don't scale until you've proven positive ROI in bounded production at realistic volume, not just POC demos.

The Time-to-First-Value Metric

Traditional IT measures "time to deployment." AI should measure time to first dollar of value—how quickly you generate measurable savings or revenue that exceeds costs.

What to track:

  • Time to hypothesis validation: How many days from "we think AI can help with X" to "offline eval confirms it works"? Target: < 14 days.
  • Time to bounded production: How many days from validated hypothesis to live production with 10% of traffic? Target: < 60 days.
  • Time to positive cumulative value: How many days until cumulative savings exceed all costs including implementation? Target: < 90 days.

Time-to-Value Benchmarks

Warning Signs
  • • Time to validation > 30 days → Hypothesis too complex, break it down
  • • Time to bounded production > 90 days → Too much infrastructure build, reduce scope
  • • Time to positive value > 180 days → Wrong use case, unit economics won't work
Winning Patterns
  • • Hypothesis validated in 1-2 weeks → Clear, testable, right-sized
  • • Bounded production within 4-8 weeks → Lean implementation, fast learning
  • • Positive value within 60-90 days → Strong unit economics, real ROI

If time-to-first-value exceeds 90 days, you're probably building too much infrastructure or chasing the wrong use case. The winners are shipping measurable value in weeks, not quarters.

Making the Economics Work: Checklist

Before deploying any AI initiative, verify unit economics pass these tests:

Unit Economics Pre-Deployment Checklist
  • Net value > 3× cost after accounting for all expenses (API, infrastructure, human review, maintenance)
  • Measured at realistic volume, not cherry-picked test cases—prove it works at 10,000 requests, not 100
  • Accounts for human review costs—don't assume "AI handles it" means zero human time
  • Includes failure remediation—when AI makes mistakes, what does it cost to fix them?
  • Time to first value < 90 days—longer timelines mean wrong use case or over-engineering
  • Business owner validates value—not just IT saying "it works" but operations confirming "this helps"

If you can't check all six boxes, don't scale. Either redesign the hypothesis, reduce scope, or kill it and move to a use case with better economics.

Chapter Takeaways

  • Measure unit economics (value vs cost per 1,000 requests), not project completion—deployment is the start, not the end
  • Net value must exceed 3× cost to justify deployment—barely positive ROI doesn't cover risk and opportunity cost
  • High-volume, low-stakes use cases work best—support tickets (1000/day) beats contract review (50/month)
  • Don't scale proofs of concept—scale proven value measured at realistic production volume with all costs included
  • Time to first dollar of value should be < 90 days—longer timelines indicate wrong use case or over-engineering

References & Further Reading

  • • Maurya, A. (2012). Running Lean. O'Reilly Media. Unit economics and validation frameworks.
  • • Blank, S. (2013). The Four Steps to the Epiphany. Customer validation and value measurement.
  • • Various (2025). AI ROI Measurement Best Practices. Emerging frameworks for AI economics.
Part IV: The Operating Model

Chapter 14: Change as Product Work

The AI isn't the hard part—changing how people work is

Here's the uncomfortable truth that most AI vendors won't tell you: the AI isn't the hard part. The hard part is changing how people work. BCG's study found that the 5% of companies extracting consistent AI value invest heavily in change management, workflow redesign, and new roles. The organisations failing are the ones that deploy AI and expect people to just... use it.

The Deployment Fallacy

Most AI deployments fail this basic test: they automate the current process instead of redesigning the workflow around AI capabilities. This is automation theater—adding steps without removing work.

Create New Roles, Don't Just Augment Old Ones

AI doesn't just augment existing roles—it creates entirely new ones that didn't exist before. The organisations winning are the ones creating these roles explicitly, with clear responsibilities and career paths.

The New AI-Era Roles

AI Product Owner

Responsibilities: Defines hypotheses, sets success criteria, decides which AI use cases to fund and which to kill. Owns the P&L for AI capabilities.

This isn't a PM and not an engineer—a hybrid role requiring business acumen and technical AI literacy.

Prompt Evaluator

Responsibilities: Tests prompts against evaluation data, tunes for performance, manages prompt versions, runs A/B tests on formulations.

This is QA for probabilistic systems—requires understanding of AI behavior and systematic testing methodology.

Triage Analyst

Responsibilities: Monitors confidence scores, adjusts routing thresholds, investigates systematic failures, tunes triage policies based on production data.

This is observability for AI workflows—like SRE but for AI reliability and performance.

Model Operations (MLOps)

Responsibilities: Swaps models quarterly, monitors drift, manages evaluation harnesses, runs comparative benchmarks, optimizes cost/performance tradeoffs.

Like DevOps but for AI infrastructure—ensures models stay current and performant.

If you don't create these roles explicitly, AI becomes "somebody's side project" and dies from neglect or mismanagement.

"We tried making AI everyone's job. It became no one's job. Once we created dedicated roles, adoption went from 15% to 80% in three months."
— VP of Operations, SaaS Company

Train AI Literacy, Not Tool Features

Traditional software training is "here's how to use the CRM: click here, enter data here, hit save." AI training needs to be fundamentally different: "here's how AI works, here's where it fails, here's when to trust it and when to override it."

Key concepts every AI user needs to understand:

  • Probabilistic outputs: AI gives you a probably-correct answer, not a definitely-correct one. Confidence scores matter. A 0.95 confidence prediction is different from 0.70—learn to act on that signal.
  • Systematic failure modes: Where does this specific AI systematically fail? (e.g., "performs poorly on non-English names" or "hallucinates citations when confidence <0.75"). Don't discover this through customer-facing errors.
  • Escalation criteria: When should you ignore the AI suggestion and do it manually? When should you escalate to a specialist? This needs to be taught explicitly, not learned through trial and error.
  • Feedback loops: How do you report failures so the system improves? If users see errors but don't report them, the system can't learn. Make feedback mechanisms explicit and easy.

Measure Adoption, Not Deployment

Deployment is "we turned it on." Adoption is "people are actually using it and getting value." The gap between the two explains most AI failures.

Adoption metrics that matter:

  • Active usage rate: What percentage of eligible users are actively using the AI? Not "logged in once" but "use it daily for >50% of eligible tasks"
  • AI suggestion acceptance rate: What percentage of AI outputs are accepted vs manually overridden? Low acceptance = trust problem or accuracy problem
  • Workaround detection: Are users routing around the AI? (e.g., copy-pasting AI outputs into manual tools, using shadow spreadsheets, reverting to old workflows)
  • Net Promoter Score from users: Would frontline workers recommend this AI to colleagues? NPS < 0 = adoption will collapse
  • Time-to-proficiency: How long does it take new users to become productive? If it takes weeks, you have a training or UX problem
Deployment vs Adoption: The Reality Check
Deployment Success (Meaningless)
  • • AI system is live in production
  • • All users have access and credentials
  • • Training sessions completed
  • • IT checklist says "Done"
Adoption Success (What Matters)
  • • 70%+ of users actively using AI daily
  • • AI suggestions accepted 60%+ of time
  • • Zero workarounds or shadow systems
  • • User NPS > +20

If deployment is 100% but adoption is 20%, you've built something people don't want or don't trust. That's a product problem, not a training problem.

Iterate Based on User Feedback

AI isn't "ship and forget." It's "ship, observe, tune, repeat." BCG's "future-built" firms treat AI as a product with a continuous improvement cycle.

The feedback loop in practice:

  1. Deploy to 10% of users (early adopters who'll provide detailed feedback)
  2. Collect feedback systematically on where AI fails, where it surprises positively, where workflows need redesign
  3. Tune based on data: adjust prompts, refine triage thresholds, retrain on newly discovered edge cases
  4. Expand to 30% of users, incorporating lessons from the first 10%
  5. Repeat the cycle until you reach 80-90% adoption with sustained positive NPS

The organisations failing are the ones that deploy to 100% on day one, then wonder why adoption is low and errors are high. Gradual rollout with continuous tuning is how you get to durable ROI.

"We shipped to 100% of agents on launch day. Adoption was 12%. We rolled back, redeployed to 10%, fixed issues, expanded gradually. Six months later we're at 85% adoption."
— Director of Customer Experience, Financial Services

The Difference Between Demo Magic and Durable ROI

Demo magic is easy: controlled environment, cherry-picked examples, no edge cases, no organisational friction. Everyone leaves the demo room excited. Then reality hits.

Durable ROI requires:

  • Workflows redesigned around AI strengths, not AI bolted onto broken processes
  • New roles to manage and tune the systems—AI Product Owner, Prompt Evaluator, Triage Analyst, MLOps
  • Training on AI literacy so people understand when to trust, when to override, how to provide feedback
  • Gradual rollout with feedback loops to improve performance based on real-world usage
  • Adoption measurement that tracks actual usage, not deployment completion

This is change management as product work. It's messy, iterative, and requires continuous investment. But it's also the difference between the 5% who extract consistent AI value and the 42% who abandon their initiatives.

Chapter Takeaways

  • The AI isn't the hard part—changing how people work is. Redesign workflows around AI strengths, don't automate broken processes
  • Create new roles explicitly: AI Product Owner, Prompt Evaluator, Triage Analyst, MLOps—don't make AI "everyone's job"
  • Train AI literacy (how AI works, where it fails, when to trust vs override), not tool features (click here, enter this)
  • Measure adoption (active usage, acceptance rate, NPS), not deployment (we turned it on)—the gap explains most failures
  • Iterate continuously: deploy to 10%, collect feedback, tune, expand to 30%, repeat—don't go 0 to 100% on day one

References & Further Reading

  • • Kotter, J. (2012). Leading Change. Harvard Business Review Press. Change management fundamentals.
  • • Heath, C. & Heath, D. (2010). Switch: How to Change Things When Change Is Hard. Broadway Books.
  • • BCG (2025). The Widening AI Value Gap. Findings on organizational capabilities driving AI success.
Part V: The Way Forward

Chapter 15: What Winners Do Differently

The patterns that separate the 5% who succeed from the 42% who abandon

BCG's research on the 5% of companies extracting consistent AI value reveals a pattern. These organisations aren't smarter, luckier, or better funded. They're doing fundamentally different things—things that look obvious in hindsight but require discipline to execute.

They Embed AI Into Core Workflows, Not Staple It On

Losers treat AI as a feature addition: "We added an AI chatbot to the website." "We enabled AI-suggested replies in the CRM." The AI sits adjacent to work, not integrated into it.

Winners redesign the workflow from first principles: "We rebuilt our customer service workflow so AI handles tier-1 queries end-to-end, agents focus on tier-2+ complexity, and escalations route automatically based on confidence scores."

The difference is integration depth. Stapled-on AI is optional—people can ignore it, route around it, or begrudgingly use it. Embedded AI is part of the standard operating procedure. It's not "would you like AI help?" but "this is how we work now."

They Start With High-Volume, Low-Stakes Use Cases

Losers chase the sexy use cases first: "AI-powered strategic planning." "Automated financial forecasting." "AI-driven M&A analysis." High stakes, complex workflows, low error tolerance—exactly where AI is weakest.

Winners start boring: "AI categorises support tickets." "AI extracts invoice data." "AI summarises meeting notes." High volume, low stakes, clear success criteria, easy to measure improvement.

Why boring wins:

  • Build evaluation harnesses on abundant data: You have 10,000 historical support tickets, not 50 strategic plans. More data = better testing.
  • Iterate quickly without regulatory risk: If ticket categorisation fails, you fix it. If financial forecasting fails, regulators investigate.
  • Prove ROI in weeks, not quarters: High-volume use cases deliver measurable value fast. Strategic use cases take months to validate.
  • Build organisational AI literacy on forgiving problems: Learn on low-stakes tasks where mistakes are recoverable, not high-stakes decisions where errors are catastrophic.

Once you've proven the model works on boring problems and the organisation understands how to deploy AI successfully, then you tackle strategic ones. Boring first. Sexy later.

"We wanted to start with AI-driven pricing strategy. Our consultant said 'start with invoice processing.' We rolled our eyes. That boring use case paid for itself in 6 weeks and taught us how AI actually works."
— CFO, Manufacturing Company

They Treat AI as a Capability to Continuously Improve, Not a Project to Complete

Losers think in projects: "We're implementing the AI solution. Go-live is Q3. Then we're done." Project managers track completion percentage. Success is measured by hitting the go-live date.

Winners think in capabilities: "We're building an AI-augmented support capability. Initial deployment is Q3. Then we tune prompts, retrain on failures, adjust triage thresholds, swap models quarterly, and measure ROI continuously."

Project Mindset vs Capability Mindset

Project Mindset (Fails)
  • • Success = hit go-live date on budget
  • • Measure completion percentage
  • • Deploy and move to next project
  • • Maintenance is someone else's problem
  • • No ongoing performance tracking
Capability Mindset (Wins)
  • • Success = sustained ROI and adoption
  • • Measure unit economics and NPS
  • • Deploy, then tune continuously
  • • Team owns capability long-term
  • • Weekly performance reviews

The project mindset leads to "deploy and forget." The capability mindset leads to continuous improvement, which is where durable ROI comes from. AI isn't software you install—it's a capability you cultivate.

They Instrument Relentlessly

Losers deploy AI and hope it works. Winners deploy AI and measure whether it works with obsessive detail.

What winners instrument:

  • Log every input, output, confidence score, user override: Full audit trail for debugging and compliance
  • Track unit economics per workflow: Cost per request, time saved per transaction, error rate, value delivered
  • Monitor drift continuously: Is performance degrading as input patterns change? Are new edge cases appearing?
  • Run A/B tests on models: GPT-5 vs Claude Sonnet 4.5 vs Llama 3.3 on the same tasks, pick the winner based on data
  • Publish internal dashboards: Everyone sees what's working and what's not—transparency drives accountability

Instrumentation isn't bureaucracy. It's how you know when to scale (ROI positive, adoption high) and when to kill (economics don't work, accuracy degrading). Without measurement, you're flying blind.

They Build for Model Swapping from Day One

Losers hardcode dependencies: "Our system uses GPT-5." When GPT-6 launches or Claude Sonnet 5 proves superior, they're stuck renegotiating contracts or rewriting code.

Winners abstract models from the start: "Our system uses a swappable LLM adapter. We're currently routing to GPT-5, but we can switch to Claude Sonnet 4.5 or Llama 3.3 with a config change and retest against our evaluation harness."

This isn't over-engineering or premature optimization. It's acknowledging that models are perishable (Chapter 6) and building for the inevitable. Model improvements happen quarterly. Your architecture should support quarterly upgrades.

They Invest in AI Literacy Across the Organisation

Losers treat AI as an IT initiative led by engineering. Winners treat it as an organisational capability requiring cross-functional fluency.

AI Literacy by Role

Executives

Understand AI strengths and limitations well enough to participate in go/no-go decisions intelligently. Don't need to code, but need to ask the right questions.

Frontline Workers

Know when to trust AI outputs and when to override. Understand confidence scores, recognize systematic failure modes, provide useful feedback.

Product Managers

Know how to frame hypotheses, set success criteria, interpret evaluation results. Own the use cases and kill decisions.

Engineering

Know how to build evaluation harnesses, implement model adapters, instrument for observability. Treat AI as continuous deployment, not one-time install.

BCG's "future-built" firms invest heavily in training, internal communities of practice, and knowledge sharing. AI literacy becomes a competitive advantage.

They Kill Failures Fast

Losers keep funding underperforming AI projects because "we've already invested so much." Sunk cost fallacy at organisational scale. Projects limp along for quarters, consuming budget and management attention, delivering no value.

Winners gate funding and kill ruthlessly based on data:

  • Hypothesis fails offline eval? Kill it immediately, document why, move to next use case. Cost: $5k, 2 weeks elapsed.
  • Shadow deployment reveals systematic failures? Kill it, try different approach or different model. Cost: $25k, 6 weeks elapsed.
  • Bounded rollout hits targets but unit economics don't work at scale? Kill it before full deployment. Don't scale loss-makers. Cost: $75k, 12 weeks elapsed.

The 42% abandonment rate includes both smart kills (failed experiments stopped early based on data) and expensive failures (projects that scaled before proving value). Winners optimize for fast, cheap kills based on empirical evidence, not executive opinion.

"We've killed seven AI initiatives in the past 18 months. The three we kept generated $4M in value. Killing fast is how we found the winners."
— Chief Digital Officer, Insurance Company

The Pattern: Different Actions, Not Different Resources

Here's what's striking about BCG's 5%: they didn't buy better AI. They didn't have bigger budgets. They didn't hire armies of PhDs. They bought differently—with experimentation, measurement, and continuous improvement baked in. And they built differently—for perishability, composability, and organisational learning.

The gap between 5% success and 42% abandonment isn't about AI technology. It's about procurement models, architectural choices, and operating discipline. The playbook is available. The tools exist. The difference is execution.

Chapter Takeaways

  • Winners embed AI into core workflows (it's how we work), not staple it on (optional feature people ignore)
  • Start with boring, high-volume, low-stakes use cases to prove ROI fast and build AI literacy—save sexy use cases for later
  • Treat AI as a capability to improve continuously, not a project to complete—tune prompts, swap models, measure ROI ongoing
  • Instrument relentlessly: log everything, track unit economics, monitor drift, run A/B tests—measurement drives optimization
  • Kill failures fast based on data—offline eval kill ($5k) beats production failure ($500k). Optimize for cheap, fast kills.

References & Further Reading

  • • BCG (2025). The Widening AI Value Gap: Build for the Future 2025. Detailed analysis of the 5% who win.
  • • Ries, E. (2011). The Lean Startup. Build-measure-learn loops and fast failure.
  • • Cagan, M. (2017). Inspired. Product management principles for continuous improvement.
Part V: The Way Forward

Chapter 16: Your Next 24 Hours

Practical actions to avoid becoming part of the 42%

Reading about AI procurement transformation is one thing. Doing something about it is another. Here's what to do in the next 24 hours—specific, actionable steps based on where you are right now.

If You're Currently Evaluating AI Vendors

Stop the traditional RFP process. Seriously. If you're running a requirements-checklist procurement with six-month evaluation timelines, you're optimizing for the wrong things and setting yourself up for the 42% abandonment rate.

Today's Action Plan

1
Pick One Boring, High-Volume Use Case

Support ticket triage, invoice data extraction, email categorization—something with measurable impact, clear baseline, and 1,000+ examples to test on.

2
Define the Hypothesis

Current baseline performance, target improvement, success criteria. "An LLM can reduce response time from X to Y with Z% accuracy."

3
Run One-Week Evaluation

Give vendors (or test yourself via API) 200-500 real examples from your data. Demand offline eval results with disclosed failure modes. Not demos—empirical testing.

4
Negotiate 30-Day Proof-of-Value Period

Deploy in shadow mode, measure against baseline, either party can exit with no penalty. Focus on learning, not commitment.

You'll learn more from one week of real testing than six months of demos and RFP responses.

"We paused our 4-month RFP, ran a 10-day test on actual data with three vendors. Had our answer in two weeks. Saved us four months and $200k in evaluation costs."
— VP of Procurement, Logistics Company

If You've Already Deployed AI and It's Underperforming

Instrument before you optimize. You can't fix what you can't measure. Most underperforming AI deployments lack basic observability, making it impossible to diagnose why they're not delivering value.

Today's Diagnostic Checklist

1
Add Logging

Capture every AI input, output, confidence score, user override. Without audit trails, you're flying blind.

2
Calculate Unit Economics

What's cost per AI request (API + infrastructure + review)? What's value delivered (time saved, errors prevented)? Is net value positive?

3
Identify Systematic Failures

Where does AI consistently fail? Certain document types? Edge cases? Low-confidence scenarios? Map the failure modes.

4
Implement Triage

Auto-approve high-confidence/low-stakes, escalate everything else. Stop checking everything manually—that kills productivity gains.

If unit economics don't improve within 30 days, you have a kill decision to make. Don't throw good money after bad.

If You're Planning an AI Initiative

Start with architecture, not use cases. If you build for one use case and hardcode dependencies, you'll have to rebuild for the next one. Architecture designed for perishability and swappability pays dividends across all future initiatives.

This Week's Architecture Decisions

Design Your Three-Layer Architecture
  • Layer 1: What commodity infrastructure will you buy? (auth, data, observability)
  • Layer 2: How will you abstract models so you can swap them without rewrites?
  • Layer 3: What's your process for generating, deploying, and retiring task-specific apps?
Build an Evaluation Harness

Even before choosing a use case, set up infrastructure to test hypotheses with gold data. This becomes your reusable testing platform.

Negotiate Model-Agnostic Contracts

Don't lock into a vendor's preferred LLM. Insist on swap-out rights, data ownership, observability, and 30-day exit clauses.

If You're a CTO or Procurement Leader

Change how you fund and gate AI projects. Stop running them like traditional IT implementations with fixed scopes and go-live dates. Shift to hypothesis-driven experimentation with staged funding.

Tomorrow's Leadership Meeting Agenda

1
Adopt Continuous Procurement

Shift from 3-year contracts to quarterly refresh cycles with exit clauses. Structure contracts for adaptability, not lock-in.

2
Implement Gates, Not Milestones

Fund in tranches based on empirical results (offline eval → shadow deploy → bounded rollout → scale). Kill at any gate if data doesn't support continuing.

3
Hire or Designate an AI Product Owner

Someone who understands both technology and business, owns hypothesis definition and kill decisions. Not a traditional PM or IT project manager.

4
Make AI Literacy Mandatory Training

Execs, managers, frontline workers—everyone needs to understand probabilistic systems, confidence scores, when to trust vs override.

If You're Skeptical AI Will Ever Work

You might be right—but test it before deciding. Your skepticism could be valid (wrong use case for your organisation) or it could be based on bad vendor experiences. The only way to know is empirical testing.

One Action This Week
  1. 1. Pick the most boring, high-volume task: Something people complain takes too much time but isn't strategic (data entry, categorization, basic summarization)
  2. 2. Spend $500 on API credits: Use GPT-5, Claude Sonnet 4.5, or Llama via API to process 1,000 examples from your actual data
  3. 3. Measure: Did it beat baseline performance? Where did it fail? What would unit economics look like at scale?

If it fails, you've confirmed skepticism with $500 and one week. If it succeeds, you have a roadmap and empirical proof to show stakeholders.

The One Thing You Must Do

If you only do one thing from this entire book, do this:

Stop treating AI like traditional software procurement.

Start treating it like R&D with rapid hypothesis testing.

Requirements → RFP → install doesn't work anymore.

Hypotheses → experiments → operating model does.

The organisations winning aren't the ones with better vendors. They're the ones with better procurement and deployment models. They're hypothesis-driven, not requirements-driven. They're measuring unit economics, not checking boxes. They're architecting for perishability, not pretending models are durable. They're building AI literacy across the organisation, not treating it as an IT project.

None of this requires bleeding-edge technology or massive budgets. It requires discipline to do procurement differently when the technology is fundamentally different.

The Shift Starts Today

The 42% abandonment rate isn't inevitable. The 5% who succeed aren't lucky or smarter. They're doing procurement, architecture, and operations differently because the technology demands it.

Every chapter in this book pointed to the same truth: AI is perishable, probabilistic, and requires continuous experimentation. Traditional software procurement—requirements documents, multi-year contracts, feature checklists, one-time deployments—is catastrophically mismatched to that reality.

You can't RFP your way to AI maturity. You can't install-and-forget. You can't treat probabilistic capabilities like deterministic features. You can't lock into three-year contracts when models improve quarterly.

But you can do procurement that acknowledges perishability. You can build architecture that supports swappability. You can measure unit economics and kill failures fast. You can train AI literacy and redesign workflows. You can treat AI as a capability to cultivate, not a product to install.

The playbook exists. The tools are available. The question is: will you use them, or will you become part of the 42%?

The shift starts today. What will you do in the next 24 hours?

Chapter Takeaways

  • If evaluating vendors: stop the RFP, pick one boring use case, run 10-day test on real data, negotiate 30-day proof-of-value
  • If AI underperforming: add logging, calculate unit economics, implement triage—instrument before optimizing
  • If planning initiative: design three-layer architecture first, build evaluation harness, negotiate model-agnostic contracts
  • If leading procurement: adopt continuous cycles, implement gates not milestones, hire AI Product Owner, mandate AI literacy training
  • The one thing: stop treating AI like traditional software (requirements → RFP → install), start treating it like R&D (hypotheses → experiments → operating model)

Final Thoughts

The 42% abandonment rate isn't a technology problem—it's a procurement and operations problem. The 5% who win aren't using different AI. They're buying differently, building differently, and operating differently. This playbook gives you their patterns. The execution is up to you.

What will you do in the next 24 hours?

The Path Forward

Your instinct is right: the old RFP-then-install ritual is mismatched to AI's pace and variability, and a lot of "AI" on offer really is yesterday's product with today's buzzwords.

But the shift isn't from shelf to pure bespoke—it's from amortized compromise to composable bespoke on a stable foundation. Treat AI as a living capability with experiments, guardrails, and swappability baked in. Buy the infrastructure that benefits from scale. Generate the applications that need to fit your workflows perfectly.

The 5% who are winning didn't buy better AI. They bought differently—and more importantly, they built differently.