Voice AI’s Fork: Conversation Companies vs Authority Companies

SF Scott Farrell July 5, 2026 scott@leverageai.com.au LinkedIn

AI Strategy · Voice

Voice AI’s Fork: Conversation Companies vs Authority Companies

“Voice AI” is one label stretched over two different businesses. One sells a conversation that sounds human; the other sells an action a human would otherwise have to authorize. The demo hides which one you’re buying — so learn to read the tells, because the value is migrating from sounding human to being authorized, and the physics of a live call decides who can make the crossing.

By Scott Farrell · LeverageAI

TL;DR

  • It’s a fork, not a category. Voice-enabled conversation prices the conversation and sells containment. Agents with authority price the resolution and need identity, authorization, escalation and an audit trail. Same demo, different companies.
  • The tells are on the homepage. Pricing unit (per-minute vs per-resolution), integration posture (“no deep integrations” is a confession), and script fidelity (“never goes off-script” means the script is the product) tell you which business you’re looking at in about thirty seconds.
  • The 50–70% ceiling is a wall, not a setting. That’s the edge of the shallow-action pool — answer, look up, route. Every point past it needs write-authority, and in regulated work that means verification, authorization, escalation and auditability, not a better prompt.
  • Latency makes context arbitrage mandatory. The natural turn-taking window is a few hundred milliseconds. The frontier model can’t answer inside it, so it’s banned from the hot path. Compiled per-client knowledge plus a fast utility model is the only way real intelligence enters a live call at all.
  • “Sounding human” is now a feature you rent. Native speech-to-speech ships from the model providers. If your moat was the voice, your moat is someone else’s default. The durable business is authority.

One name, two businesses

Put two voice-AI products side by side on a call and you often can’t tell them apart. Both pick up in half a second, both sound warm and unflustered, both handle the first ninety seconds beautifully. The demo is designed to make them look like the same thing at different stages of maturity — as if you’re choosing between an earlier and a later version of one product category.

You’re not. You’re standing at a fork. On one side is a company that has built voice-enabled conversation: its product is a fluent exchange, and it’s very good at holding one. On the other side is a company building agents with authority that happen to speak: its product is a resolved outcome — a refund issued, a plan changed, an appointment moved — and voice is just the interface it wears. These are not two rungs of one ladder. They are two businesses with different cost structures, different moats, and different reasons to exist. Telling them apart is the whole game, whether you’re buying or building.

Conversation companies

Voice-enabled conversation

  • Prices the conversation — per minute, per call, per contained session.
  • Sells containment: the call didn’t reach a human.
  • Treats “no deep integrations” as a selling point.
  • Promises it never goes off-script and stays on-brand.
  • Moat is sounding human — which is commoditizing from underneath.
Authority companies

Agents with authority that happen to speak

  • Prices the resolution — per outcome, per successful action.
  • Sells a result: the caller’s problem is actually fixed.
  • Treats deep integration as the price of doing the job.
  • Needs identity, authorization, escalation, auditability.
  • Moat is being authorized to act — which is where the value is going.

Here is the migration in one line, and it’s the thesis of this whole piece: the value in voice AI is moving from sounding human to being authorized. Sounding human used to be the hard part and the differentiator. It is becoming a checkbox. Being trusted to do something — to reach into a system of record and change it, on behalf of a verified person, with a receipt you can defend later — is the part that’s getting harder, more valuable, and more defensible. If you only remember one sentence, remember that the axis of competition has rotated.

Read the tells

You don’t need a technical audit to place a vendor on the fork. You need three questions, and a vendor’s own marketing usually answers all three before you get to the pricing page. I’ve started treating them as a checklist, because they’re that reliable.

Tell Conversation company Authority company
Pricing unit Per minute / per conversation / per contained call. The conversation is the product. Per resolution / per successful action. You pay when something changed.
Integration posture “No deep integrations needed” — sold as ease. It’s a confession. Deep integration into systems of record, framed as the whole point.
Script fidelity “Never goes off-script, always on-brand.” The script is the deliverable. Talks about handling exceptions, edge cases, and when it hands off. The outcome is the deliverable.

The middle row is the one people misread, so it’s worth slowing down on. When a vendor advertises “no deep integrations required” as a benefit, they are telling you the truth about their product: it doesn’t touch anything that matters. Deep integration is not friction to be engineered away — it is agency. An agent that can’t reach into your billing system cannot issue a refund; it can only talk fluently about how it would. “No integrations” isn’t a convenience feature. It’s the sound of a product that has quietly scoped itself to conversation and is hoping you’ll read that as simplicity. Sometimes conversation is genuinely all you need — deflect FAQs, take a message, route the call. Just don’t pay resolution prices for it, and don’t build a business plan that assumes it will grow into authority on its own. It won’t; the wall is real, and we’re about to hit it.

A worked case: reading a homepage cold

Take a representative conversational-voice vendor — the archetype, not any one company — and read its homepage the way the checklist tells you to. The action list is all answer questions, capture details, book a slot, take a message. The pricing is per minute or per handled conversation. The integration section leads with “connect in minutes, no engineering required.” The trust section promises the assistant “stays perfectly on-brand and never goes off-script.”

What the page just told you

Every headline points the same way: this is a conversation company. It prices conversations, it sells containment, and it advertises the absence of the exact capability — deep write-access — that authority requires. A beautiful product, honestly described. Just not the one that will still be defensible when the voice itself is a rented feature.

Notice what the read did and didn’t do. It didn’t call anyone a fraud, and it didn’t need private information — the positioning is right there, stated proudly, because to the conversation company these are the strengths. The tells aren’t hidden; they’re the pitch. That’s what makes the checklist so cheap to run: you’re not catching anyone out, you’re just refusing to let a great demo blur a structural distinction. (This is why we cover this ground carefully and generically rather than naming and shaming a specific vendor — the point is the pattern, and the pattern is everywhere.)

The 50–70% ceiling is the edge of the shallow pool

Every conversation vendor eventually quotes a containment number, and it usually lands somewhere between 50% and 70% on well-scoped, transactional work.1 That range is not a coincidence and it’s not a tuning problem you can prompt your way past. It’s the size of the shallow-action pool: the set of things you can do on a call without changing anything of consequence. Answer a question. Look up an order status. Take a message. Route to the right queue. That pool is genuinely large, and automating it is genuinely useful — but it has an edge, and the edge is at roughly 70%.

The first thing to get straight is that containment is not resolution. Containment measures whether a human touched the call. Resolution measures whether the caller got what they needed. A call can be contained and unresolved at the same time — the AI kept it away from a person and the customer still hung up without their problem fixed — and per-conversation pricing hides that gap, because the vendor got paid either way.2 The metric that actually matters to you is resolution; the metric that flatters the conversation company is containment. When those two diverge, you’re paying for the wrong one.

What sits past 70% is the deep pool, and it looks like this:

  • Refunds and credits — moving money, which needs authorization and a defensible record of who approved what.
  • Plan and account changes — writing to the system of record, which needs the caller’s identity actually verified, not merely claimed.
  • Payments and orders — irreversible actions, which need escalation paths for the cases that shouldn’t be automated at all.

Every one of those is a write, and every write in a regulated vertical drags four requirements in behind it: identity verification (is this person who they say they are), authorization (are they allowed to do this, is the agent allowed to do it for them), escalation (what happens when the answer is “a human must handle this”), and auditability (can you reconstruct and defend the decision months later). None of those are conversational skills. That’s why the ceiling holds: the conversation company built for fluency, and the things past 70% don’t reward fluency — they reward authority. This is the same terrain I mapped in Is Voice AI Ready for Inbound Calls? Not Yet. — the readiness pillars are, in effect, the list of what you must own before you’re allowed above the ceiling.

Latency bans the frontier model — and that makes context arbitrage mandatory

Now the physics, because the architecture of the authority company isn’t a design preference — it’s forced. Human turn-taking runs on gaps of roughly two hundred milliseconds; the natural conversational window is somewhere around 200–500ms.3 Push past a second and the caller feels they’re talking to a machine; push past two and they start talking over it and abandoning the call. That window is your entire budget for the whole round trip — and most of it is already spent before the model gets a word in. Speech recognition and endpointing eat 150–300ms; text-to-speech eats another 100–200ms before the first audio frame.4 A naive pipeline that hands the turn to a frontier model and waits blows the budget on its own; cascaded stacks routinely landed at two to four seconds.3

So the frontier model is banned from the hot path. Not deprioritized — banned, by the clock. You cannot make a large reasoning model answer inside a few hundred milliseconds by wanting it to. Which raises the real question: if the smartest model can’t be in the live loop, how does any real intelligence get into the call?

There’s exactly one answer, and it’s the reason this article exists. You move the intelligence off the hot path and compile it ahead of time: a per-client knowledge layer, built offline by the expensive model when there’s no clock running, then read at call time by a cheap, fast utility model that fits inside the latency budget. The thinking happens before the call; the call just reads the result. This is context arbitrage — turning intelligence from a per-request operating expense into a compiled, reusable asset — except here it isn’t a clever cost optimization you choose. It’s the only door intelligence can walk through. Latency makes context arbitrage mandatory. The live call can only afford a lookup, so the intelligence has to already be sitting where the lookup lands.

The architecture: fast lane, slow lane, prefetch bridge

Once the frontier model is off the hot path, the shape of the authority product falls out almost deterministically. It’s two lanes and a bridge, and it’s the concrete form of the Fast-Slow Split applied to a live phone call.

One call, two lanes, one bridge

CALLER ─▶ FAST LANE  (holds the phone)
            · cheap utility model, cached, sub-second
            · answers, looks up, keeps the turn alive
                 │
                 │  caller identified
                 ▼
          PREFETCH BRIDGE
            · loads this caller's world BEFORE the turn that needs it
            · account, entitlements, compiled per-client knowledge
                 │
                 ▼
          SLOW LANE  (holds the authority)
            · gated writes: refunds, plan changes, payments
            · identity + authorization + escalation
            · every action receipted for audit

The fast lane holds the phone. It’s the cheap cached utility model, running sub-second, doing all the shallow-pool work and keeping the conversation feeling human. The slow lane holds the authority: the gated, verified, receipted writes that actually resolve things. It doesn’t have to be fast, because it doesn’t run in the turn-taking window — it runs behind a “let me take care of that for you” while the fast lane keeps the human company. The prefetch bridge is the piece people forget: the moment a caller is identified, you load their world — their account, their entitlements, the compiled knowledge relevant to them — before the turn that needs it arrives, so the slow lane’s answer is already waiting when the fast lane reaches for it. Deploy each lane where physics is on your side; that’s the Lane Doctrine in one sentence.

This is also why “no integrations” is a dead end and not a starting point. The slow lane is integration — it’s the write-access, the identity checks, the escalation wiring, the audit log. A company that advertised the absence of all that hasn’t built a smaller version of the authority product. It has built a different product that stops at the bridge.

“Pre-trained personas” is prompt-stuffing in a nicer suit

One more tell, because it’s the most sophisticated-sounding one and it trips up good buyers. Vendors will offer “pre-trained personas configured for your brand,” and it sounds like they’ve built something bespoke and durable for you. Usually they’ve built a bigger system prompt. Per-client persona configuration, at vendor scale, is prompt-stuffing — brand tone, a few FAQs and some guardrails pasted into the context window per tenant. It’s fine for making the conversation sound like you. It does nothing to get you above the ceiling, because a persona has no authority; it’s a costume, not a capability.

The real version of “configured for your brand” is a compiled per-client knowledge layer — your policies, entitlements, edge cases and the answers to your hard calls, built offline into a structure the fast lane can read in a millisecond. That’s the same object context arbitrage produces, and it’s the thing the prefetch bridge is loading. A persona is read into the model’s prompt and forgotten at the end of the call. A compiled knowledge layer is an asset that sits beside the call and compounds. If a vendor’s answer to “how do you know my business” is a prompt, you’re on the conversation side of the fork. If it’s a compiled, versioned, per-client artifact, you’re looking at the real thing.

The question that sorts everyone

All of this reduces to one interview question — the one I’d ask a vendor, a founder, or myself before writing a line of code or a cheque:

Are you building voice-enabled conversation, or agents with authority that happen to speak?

There’s no wrong answer, only an honest one and a confused one. Plenty of real value lives on the conversation side — deflection, triage, after-hours coverage — and if that’s the business, price it as conversation, integrate lightly, and don’t promise resolutions you can’t write. But know that the ground is moving. Sounding human is depreciating to zero: native, expressive, low-latency speech-to-speech now ships straight from the model providers as a first-party feature.5 If your entire moat was that your voice sounded real, your moat is now a default setting on someone else’s API. The business that survives the commoditization is the one holding the authority — the verification, the write-access, the receipts — because that’s the part a model provider won’t hand you for free, and the part your customers can’t get anywhere cheaper. The demo will keep looking the same on both sides of the fork. The economics won’t.

The tells, distilled

  1. Sort by authority, not by voice. The human-sounding axis is commoditizing. Which business is it — conversation, or authorized action?
  2. The pricing unit is the confession. Per-minute prices the conversation; per-resolution prices the outcome. The unit tells you what the vendor actually built.
  3. “No deep integrations” means no agency. Deep integration is the price of doing anything real. Advertised absence of it is a scope statement, not a convenience.
  4. 70% is a wall, not a knob. The ceiling is the edge of the shallow-action pool. Past it lives write-authority: verify, authorize, escalate, audit.
  5. Containment is not resolution. Measure whether the problem got fixed, not whether a human was avoided. Pricing that rewards containment will hide the gap.
  6. Latency bans the frontier model, so compile ahead of time. Fast utility model in the hot path, expensive model offline, compiled knowledge read at call time. Context arbitrage isn’t optional here — it’s the only door.
  7. Two lanes and a bridge. Fast lane holds the phone, slow lane holds the authority, prefetch loads the caller’s world before the turn that needs it.

Run the checklist on the next voice-AI page you read

Pick one voice vendor — one you’re evaluating, or your own product’s homepage — and answer three questions from its own marketing: what’s the pricing unit, how does it talk about integrations, and does it sell script fidelity or exception handling? Then place it on the fork. If it’s a conversation company charging resolution prices, or an authority company hiding behind “no integrations,” you’ll have found it in about a minute. Tell me what you found — I read every reply.

References

  1. [1]ETS Labs. “Voice Agent Automation in Contact Centers: What Works at Scale.” — “Well-configured deployments on transactional workflows commonly land 50–70%” containment; initial deployments in the right categories 60–80%, maturing to 75–85% with tuning on real call data. etslabs.ai/blog/voice-agent-automation-contact-center · CloudTalk, “Key Metrics & KPIs for AI Voice Agents.” cloudtalk.io/blog/ai-voice-agent-kpis
  2. [2]Moveo.ai. “Containment Rate: the metric that reveals your AI’s performance.” — Containment (no human touched the call) is distinct from resolution; a call can be “contained but unresolved.” The quality metric is Automated Resolution Rate: contained and satisfactorily resolved. moveo.ai/blog/containment-rate
  3. [3]Tavus. “Latency in conversational AI: a testing guide for sub-second response.” — “Human turn-taking involves extremely short gaps — typically ~200 ms”; the natural window is ~200–500 ms, and cascaded ASR→LLM→TTS pipelines “imposed 2–4 second delays, far outside the natural window.” tavus.io/blog/latency-of-response-in-conversational-ai
  4. [4]Time-to-first-audio budget: endpointing + ASR ≈ 150–300 ms and TTS ≈ 100–200 ms before the first audio frame; native speech-to-speech (OpenAI GPT-4o / gpt-realtime) reaches median ~320 ms voice-to-voice. OpenAI, “Introducing gpt-realtime.” openai.com/index/introducing-gpt-realtime · Salesforce Engineering, “How AI-Driven Testing Enabled Sub-Second Latency for Agentforce Voice.” engineering.salesforce.com
  5. [5]Native speech-to-speech now ships as a first-party model feature: OpenAI gpt-realtime (speech-to-speech, tool calls, SIP phone calling) and Google Gemini Live (native-audio, 90+ languages). “Sounding human” is collapsing into the model provider’s default. openai.com/index/introducing-gpt-realtime · ai.google.dev/gemini-api/docs/live-api

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2026 Leverage AI, Scott Farrell. All rights reserved. This content is made available on a limited, revocable, read-only basis only. No licence or right is granted to copy, reproduce, republish, scrape, store, adapt, summarise, index, embed, or use this content to create derivative works, work product, deliverables, methodologies, training materials, prompts, templates, software, services, research, or commercial outputs, whether by humans or machines, without prior written permission. This restriction includes internal business use, client work, consulting, advisory, implementation, and any use in or for artificial intelligence, machine learning, data extraction, retrieval, evaluation, fine-tuning, or knowledge-base construction.