Engineering Field Note

Text Is the Model’s Home Turf

A field note on the pendulum between code and judgment — and what a year of building a self-repurposing content pipeline taught us about when to trust the model and when to trust the check.

A build retrospective from the LeverageAI engineering bench. First-person, expansive on purpose — the interesting part is the evolution, not any single component.

TL;DR

Use AI for judgment, deterministic code for ground truth — and be willing to swing a component either direction as the evidence dictates. The pendulum is the method.
Text is the model’s home turf. Every time we converted an image-shaped problem into a text-shaped one, the system got cheaper and more accurate at the same time.
The loop helped build itself. AI did the analysis, wrote the deterministic code, proposed the architecture, and tested its own choices — while the human held the few lines that mattered.

Where we started

It began with a small delight. We rendered a pull-quote from an article as a clean, branded image card — a screenshot of styled HTML — and the reaction was immediate: woah, that’s cool, beautiful images straight out of the article. The image was the product. Everything else was scaffolding around making more of them.

That framing — the image is the thing — quietly shaped the next year of decisions. Most of the journey is the story of unlearning it.

Act I — Ask the model: “is this a good image?”

The first instinct was the obvious one. We had images; we wanted the good ones. So we asked a multimodal LLM to look at each card and score it: is this a good image for social media? One to ten. Keep the high scorers.

It worked well enough to be encouraging and badly enough to be instructive. The model would look at a branded card and rate it — but it was rating a picture, and a picture of text is a strange thing to ask a language model to judge. We didn’t see the problem yet. We just saw numbers that were mostly sensible, and moved on.

The tweet came next, almost as an afterthought: alongside the image, generate a short post to carry it. An aside. A caption. That aside turned out to be the real product. We just didn’t know it for several more moves.

Act II — The meaning jump: give the model the why

The captions were flat at first, because the model only had the quote. It would paraphrase the card — dead weight, since the card already shows the words. The leap came from changing what context we fed it: instead of the quote alone, we gave it the article’s outline and the chapter the quote came from. Suddenly the model could see the quote as one move inside a larger argument. It stopped echoing and started surfacing the idea underneath — the consequence, the second-order effect, the uncomfortable implication the quote only hinted at.

That was the first big jump, and it set a pattern we’d repeat:

Quality came from giving the model the right representation to reason over — not from a cleverer instruction.

Act III — Generate-and-judge: score by the post it could become

Then a reframe that, in hindsight, dissolved the original problem. We stopped asking “is this quote good?” and started asking the model to do the task and judge its own work: draft the tweet you would actually post, then rate how well that tweet — alongside the card — would land.

The score became a byproduct of attempting the work. This is “generate-and-judge,” and it beats one-shot rating because the model can’t anchor on a number and rationalise backward; it commits to a real artifact first, reasons about it, and scores last. We enforced exactly that order in the output schema — draft_tweet first, final_rating last — so the model’s “thinking” genuinely precedes its verdict. (The instinct rhymes with self-consistency, where sampling and reconciling actual reasoning paths beats a single greedy guess.¹)

Something subtle had happened: the scorer and the writer became the same act. The “assessment” was now a real post, and we read its self-rating off the side. The text — the thing we’d treated as an aside — was now doing the heavy lifting.

Act IV — The image was the wrong thing to show the model

Here’s the realisation that turned the whole pipeline inside out.

We were still sending the rendered card image to the model so it could judge the “combo.” But a language model reads overlapping, clipped, or mangled text just fine — it parses the characters and never notices that, visually, the card is broken. So the model was blind to render defects. Worse, in head-to-head tests the image added almost nothing to the ranking (rank correlation was ~identical with and without it), and when present it actually inflated junk — it would rescue a citation card or a half-rendered box that a human eye would reject at a glance.

There’s a known shape to this failure: language models tend to over-trust text that reads fluently — they reward low-perplexity, familiar-looking output regardless of whether it’s actually correct.² A mangled card whose characters still parse is exactly the kind of thing such a judge waves through. The same literature is why we never let the model be the sole arbiter of its own output and instead gate the render against ground truth.

The conclusion was uncomfortable but clean: for judging a card, the image is the wrong representation. Text is the model’s home turf. The image’s only real job was to tell us whether the render was any good — and a language model is precisely the wrong tool for that, because it’s too forgiving of visual garbage.

So we did the thing that felt backwards and was right: we stopped sending the image to the scorer at all. Scoring went text-only. The image became a render artifact to be validated separately, by something that actually cares about pixels.

The proof, later, vividly

A card had its attribution line overlapping the quote — a stray negative margin from the source HTML. The vision model read “— Diana Hu, Y Combinator” perfectly and never flagged it. A deterministic OCR pass read the same card and its word-coverage against the known text dropped to 63% — caught. The “dumber” tool was the more accurate one, because it was the right tool.

The meta-theme: the pendulum between deterministic and AI

This is the through-line of the entire project, and it’s worth stating plainly, because we swung the pendulum both directions, more than once:

AI → deterministic. Judging whether a card rendered correctly started as an LLM vision task and moved to deterministic checks: a DOM probe (is it styled? does anything overflow?) plus OCR-coverage against the ground-truth text. Reliability and ground truth beat judgment here.
deterministic → AI. Deciding which passages are interesting started as fourteen hand-written CSS-heuristic extractors and moved to an LLM. Judgment beats rules here, and the rules had become an endless game of whack-a-mole — every new article’s styling needed a new heuristic: coloured left-bars, grey callout boxes, reference-chapter exclusions, dedup, and on, and on.

The principle that emerged:

Use AI for nuance and judgment; use deterministic code for reliability and ground truth. The skill isn’t picking a side once — it’s knowing, for each sub-problem, which kind of problem it actually is, and being willing to move a component the other way when the evidence says so.

A corollary we kept relearning: a deterministic check is more trustworthy when it compares against something known. OCR alone is noisy. OCR compared to the source text we already have is robust — we’re not asking “what does this say?”, we’re asking “is the text we expect actually present, and is there foreign text that shouldn’t be?” That reframing turns a flaky tool into a reliable gate.

Image → text, wherever we can

The deeper pattern under “don’t send the image” is this: modern AI can do images, but it is not as nuanced with images as it is with text. Its reasoning is sharpest over text representations. So a recurring, high-leverage move was to convert image-shaped problems into text-shaped problems:

Judge the card → judge the quote text + the tweet about it (text).
Validate the render → OCR the card to text and compare to the source text (text vs text, deterministic).
Even the OCR choice followed this logic: we put the heavy lifting on Apple’s Vision framework running locally on an M4 — a deterministic, Neural-Engine OCR that reads clean rendered text (including the italic serif that lighter OCR engines miss entirely) and hands back text, which is the representation everything downstream actually wants.

Every time we turned a pixel question into a text question, the system got both cheaper and more accurate. That’s not a coincidence — it’s the model meeting us on its own ground.

The render-quality detour: fix it at the source

Validation taught us to detect bad renders. But detection invites a better question: why are they bad? The failures were almost all the source HTML’s responsive styling misfiring when we screenshotted a single card out of a full-page layout:

An attribution with a negative top margin that looks fine in the flowing page but overlaps the quote on an isolated card.
A callout set to float right at ~45% width on large screens that, at our wide screenshot viewport, floated out of its container and got clipped — the screenshot “clipping the wrong piece.”

The first fixes were CSS overrides (kill the negative margin, kill floats). The real fix was upstream and almost embarrassingly simple: render the card at a narrow ~700px viewport, below the framework’s responsive breakpoints. At that width the page renders in its single-column layout — none of the multi-column/float machinery fires, nothing escapes, and the screenshot clips exactly the right region. We stopped fighting the symptom and removed the cause. Same spirit as the rest of the project: don’t detect-and-patch what you can prevent.

And the validation gate itself got a second axis, because “is all the text there?” isn’t the whole question. “Is there extra text too?” matters just as much — a mis-clipped card can contain the right words plus a neighbour’s content. So the check became bidirectional: recall (is the quote present?) and precision (is anything foreign present?), both measured against the known card text. Two cheap, deterministic numbers that together describe “did this render correctly.”

Act V — Let the AI pick the quotes (and the combinations)

With scoring text-only and rendering validated deterministically, the last bastion of brittle heuristics was selection itself — the fourteen extractors deciding what counts as a quotable passage. They were the original sin: deterministic code trying to find “interesting,” which is exactly the judgment that AI should own.

So we moved selection to the model. But the crucial design choice was what to show it. Not the rendered images (we’d learned that lesson). Not raw HTML (utility-class noise drowns the signal). Not markdown (it loses the visual signal — a coloured callout box becomes plain text, but the box is part of why a passage is interesting). The answer was a denoised semantic DOM:

strip the noise (scripts, styles, utility-class clutter, nav/footer),
give every content scope a stable #id,
annotate emphasis as a hint (blockquote, coloured box, left-bar, grey background),
and send the whole chapter — drop nothing.

This is the idea we kept coming back to: a “text vision” of the document. Not a picture for the model to look at, but a clean, structured map for the model to read — the way a text model actually perceives a chapter at its best. The denoiser’s only jobs are to clean and to label scopes; it makes no judgment about what’s interesting. We measured this: the denoised block list preserves 99.4–100% of the chapter’s words. It curates nothing. The model does all the discrimination.

Two more principles made it sing:

North-star prompting, not a checklist. We stopped enumerating “types” of good quotes and gave the model a single goal — “anything interesting enough that we could build a compelling tweet or post around it” — plus permission to favour recall, combine blocks into one card (a heading + a box, two related boxes), and overlap freely (the same line alone, and again inside a larger arc). Less prescription, more judgment.
The model returns references, not text. It selects by #id. We then extract the text and HTML from the DOM by those ids. This keeps the scored text, the rendered card, and the validation gate referencing the exact same bytes — they can’t drift, and the model never has to transcribe (the one thing it’s worse at than judging). Crucially, a selection is one or more whole scopes — never a fragment of a paragraph, because you can’t screenshot half a paragraph. Judgment lives in the model; the renderable truth lives in the DOM.

The results were striking. On a single chapter the model returned ~20 thoughtful selections — singles, combinations, overlaps, with-or-without the heading — each justified against the north star in language that was genuinely social-media-literate (“the secretary line is standalone tweet gold,” “if either agent dies the state survives — the one-sentence proof of the second axis”). And the regression test we cared most about: against the existing high-rated quotes, the new selector independently rediscovered 98% of them — while adding all the combinations and framings the heuristics could never produce. The single “miss” was a transitional setup sentence that arguably shouldn’t be a card. The model wasn’t just keeping up with the rules; it was exercising better taste than them.

We also learned the context could be lean: outline + the chapter’s blocks was as good as outline + the entire article, at half the tokens. The chapter is already in front of the model; the outline gives it the frame; the rest was redundant.

The economics, because they shaped the design

None of this is free, and cost pressure repeatedly improved the architecture:

Scoring on a premium model quietly consumed our plan quota, which forced a real cost/quality study: a benchmark harness — with a persistent, keyed result store so we never re-pay for a test we’ve already run — comparing models on rank-correlation, cache behaviour, and tweet quality.
That study taught us the cost lives in input tokens and caching, not in the model’s cleverness — a large, constant context prefix reused across hundreds of quotes should be cached, and the cards’ images are a rounding error next to it.
It also pushed work onto the right machines: plan-funded routing for the heavy judgment, a free text-only model for the recall-favouring selection pass, a local Mac for OCR. The cheapest configuration was usually also the most accurate one, because cheapness came from using each tool on its home ground.

Cost wasn’t a constraint we suffered; it was a design force that kept pointing us toward “use the right representation on the right substrate.”

The recursive twist

Step back and the whole thing has a pleasing recursion. The source articles are themselves written with heavy AI assistance — drafted, researched, structured by models. And now the pipeline repurposes them aggressively, with AI, into other media — pull-quotes, cards, tweets, LinkedIn posts. AI-authored content, AI-repurposed, with deterministic rails where ground truth matters.

There’s a fitting resonance in the test article we kept using — “Designing Loops, Not Prompts.” Its thesis is that the unit of AI work has moved from the prompt to the loop: small systems that find the work, hand it out, check it, record what was done, and decide the next thing — while you watch instead of type. That is almost exactly what this pipeline became. We stopped writing prompts to extract quotes and started designing a loop that selects, renders, validates, scores, and posts — with the human holding the north star and the deterministic rails, and the AI holding the judgment. We built the thing the article was about, to repurpose the article itself.

The other recursion: AI built this, too

The product uses AI. But so did the building of it — and that turns out to be the deeper recursion. The AI in this project wore four hats, not one:

1. Analysis (LLM over text)

Scoring quotes, drafting the tweets and LinkedIn posts, selecting passages, judging which framings land — the model reasoning over text, which is its sweet spot. This is the part everyone pictures when they say “AI in the loop.”

2. Writing the deterministic code

The denoiser, the render check, the 700px-viewport fix, the Apple Vision service, the CSV plumbing, the auto-poster’s selection math — the AI wrote the non-AI parts too. The boring, reliable rails that the judgment runs on were themselves AI-authored. The system isn’t “AI where it’s smart, hand-written where it’s safe”; it’s AI on both sides of that line, used differently on each.

3. Recommendations — i.e. design

This is the one people underweight. Several of the bigger architectural moves came from the AI as a suggestion, not from a human spec. The “don’t let the model judge the image — OCR it and compare to the known text” reframe was an AI recommendation. So was a lot of the “convert the pixel problem to a text problem” instinct. Other big moves came from the human: “move quote selection to an LLM that breaks a chapter into interesting passages” was a human call. And some of the most important decisions were the human holding a line against the AI’s natural drift — most of all “send the whole chapter to the model; don’t pre-select what’s interesting in code” — which kept the design honest when the AI kept wanting to be helpful by filtering first. Design, it turns out, is a conversation: the AI proposes, the human disposes, and either can originate the idea that changes everything.

4. Testing and verification

Smoke tests before integration; wiring a new component in and running it end-to-end; checking the new approach back against the last run (the coverage test that proved LLM selection rediscovered 98% of the known-good quotes before we trusted it); benchmark harnesses with persistent result stores; head-to-head comparisons of a deterministic check versus an LLM decision to decide which tool owned which job; retries and back-offs around flaky model calls; A/B-ing the context to see what actually mattered. The AI didn’t just produce the system — it produced the evidence that each choice was right, and was willing to be proven wrong by it.

The honest summary of the collaboration: analysis, code, and recommendations — which together are just “design” — plus the testing that keeps design accountable. The human’s irreplaceable contributions were taste (what “interesting” and “lands” actually mean), direction (the north star, the lines held), and judgment about the judgment (deciding when to trust the model and when to trust the deterministic gate). Everything downstream of those calls, the AI could draft, build, test, and argue for.

AI helped design and build the loop — including the parts that aren’t AI — that repurposes the AI-written articles, with a human holding the few lines that mattered.

The unit of engineering really has moved from the prompt to the loop. This is what it looks like when the loop helps build itself.

Principles, distilled

A working list — the things we’d tell ourselves at the start:

1Use AI for judgment, deterministic code for ground truth. And be willing to move a component either direction when the evidence says so. The pendulum is the method.
2Text is the model’s home turf. Convert image-shaped problems into text-shaped problems; you’ll get cheaper and more accurate at the same time.
3The image isn’t the product. It’s a render artifact. The text — the post — is the product. Validate the artifact deterministically; judge the product with the model.
4Validate against what you know. A check that compares output to ground truth (OCR vs source text, both directions: present and not-foreign) beats a check that judges in a vacuum.
5Prevent at the source over detect-and-patch. The narrow-viewport render fix removed a whole class of failures the validator had been catching.
6Show the model a map, not a picture. A denoised, id-tagged, emphasis-annotated DOM is “text vision” — how a language model best perceives a document. Clean and label; don’t curate.
7Generate-and-judge. Make the model do the task and rate its own work; the score is a byproduct of real effort, and the ordering (work first, verdict last) matters.
8North-star prompts over checklists. Give the goal and the permissions (combine, overlap, recall-favour); trust the judgment. Prescription is the heuristic trap in prose form.
9Reference, don’t transcribe. Let the model select by id; extract the bytes from the DOM. Judgment in the model, truth in the source — and they never drift.
10Cost is a design force. Cheapness usually coincides with correctness, because both come from using the right tool on its home ground (cache the constant prefix; free model for the cheap pass; local OCR for pixels).
11AI is a design collaborator, not just a runtime component. It does the analysis, writes the deterministic code, and proposes architecture — which together is design — then tests its own choices (smoke tests, regression against the last run, deterministic-vs-LLM bake-offs, back-offs on flaky calls). The human’s irreplaceable part is taste, direction, and the few lines worth holding (“send the whole chapter; don’t pre-select”). Either side can originate the idea that changes everything; keep the conversation honest with evidence.

The interesting part was never any single component. It was the evolution — the repeated, evidence-driven swing between code and model, between image and text, until each part of the system was finally doing the kind of work it was actually good at.

Your turn

Take one model-in-the-loop step you run today. Is its job judgment or ground truth? Are you handing the model an image of something whose text you already have? Could you replace its verdict with a comparison against known text — both directions, present and nothing foreign? If a rule list is doing judgment, or a model is guarding ground truth, you’ve found a component to swing.

If you’re building these loops, I’d love to compare notes — leave a comment or reach out.

References

This is a first-person build retrospective; the numbers above are our own measurements, not external research. The two references below are external work that genuinely strengthens specific claims — included lightly, and only where they belong.

[1]Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” — Sampling diverse reasoning paths and reconciling them beats a single greedy guess; the instinct behind “do the work first, score last.” arxiv.org/abs/2203.11171
[2]Wataoka, K., Takahashi, T. & Ri, R. “Self-Preference Bias in LLM-as-a-Judge.” — “LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated.” Why a model that parses mangled text waves it through, and why we gate the render against ground truth rather than letting the model be the sole judge. arxiv.org/html/2410.21819v1

A LeverageAI engineering field note. Related reading: Designing Loops, Not Prompts — on why the unit of AI work moved from the prompt to the loop.

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

Text Is the Model’s Home Turf: A Field Note on the Pendulum Between Code and Judgment