Field Note · Build Retrospective

The Skeleton of a Visual

Judging and generating images through their structure, not their pixels. A flattened, denoised DOM is to a rendered image what a tree-sitter skeleton is to code — structure without the bytes — and a language model reasons better over that skeleton than over the picture itself. I found this out twice: once by building it, and once by discovering I’d already published it.

By Scott Farrell · LeverageAI

TL;DR

Give the model the skeleton, not the picture. Render a card, extract its DOM, deterministically denoise and flatten it, and you get a text representation of a visual asset that still carries its structure. Judge and generate against that.
The shape of a thought is a first-class representation. Distinct from its prose and from its rendering — and models reason better over shape than over either extreme. Feed a generator the original structure (bullets, a table, a hierarchy), not a slab of prose, and both the image generator and the HTML/CSS generator get markedly better.
Deterministic checks first, AI judge second. On the flattened skeleton you can ask two cheap, exact questions — did the words survive verbatim, did anything clip — before any model is allowed an opinion.
The best argument for the tool is how I found it: I rediscovered, mid-build, a finding I’d already written up. The system whose whole job is “where did I have this thought” would have answered in one lookup.

The tell: I cited myself without noticing

I was deep in the quote-card pipeline — the machine that breaks an ebook into quotable fragments, renders each as a branded image, quality-checks it, and hands it to a social writer — when I hit a result that felt like a small discovery. I’d been sending the rendered card image to a model to ask “is this any good?”, and it kept being oddly generous about cards a human would reject at a glance. What worked far better was not sending the image at all: send a text proxy of what the image looked like, and the judgement got sharper and cheaper at once.

I said, out loud and pleased with myself, “I think I’ve read an article about this somewhere.” I had. It was mine. Text Is the Model’s Home Turf, published a few weeks earlier, is a whole field note about converting image problems into text problems and validating renders deterministically instead of asking a vision model to eyeball them. I had independently rediscovered my own published finding.

Which is funny, and also the single most useful thing that happened that week — because it is the exact argument for the system I was building around all of this: a personal wiki whose one job is “where did I already have this thought.” A working version would have answered in one lookup and saved me the re-derivation. Instead I got the lesson the expensive way, so let me at least bank the interest. Because the re-derivation didn’t just repeat the old finding — it promoted it. What had been a handy trick turned out to be a named tool with a second, larger use I hadn’t seen the first time.

What the skeleton actually is

Start with the move itself, because it’s mechanical and boring in the good way. A quote card is styled HTML that we screenshot with Playwright. To judge it, the naive path is to reason over the screenshot. The better path:

Render the HTML/CSS in a real browser (Playwright), so layout, wrapping and emphasis are resolved exactly as they will appear.
Extract the DOM of the rendered result — not the source HTML, the resolved tree. (Playwright will even hand you the accessibility tree of a page as structured text off the shelf; we go further.)³
Denoise and flatten it with deterministic code: strip scripts, styles, utility-class clutter and wrapper <div> soup; keep the content scopes; annotate the emphasis that survived rendering (this is a heading, this is a quote, this was a coloured callout).

What falls out is a compact, ordered, text representation of what the visual asset actually is — accurate, and, crucially, structured. It is not a description of the picture and it is not the source markup. It is the picture’s skeleton.

The analogy that made it click is from the code world, because the code tools got here first. When an agent needs to understand a codebase, the trick isn’t to feed it every line — it’s to run tree-sitter over the source and emit a skeleton: signatures and structure, not full implementations, with the most-referenced symbols surfaced first so the model gets the shape before the detail.¹ The skeleton is smaller than the code, throws away the bytes, and is better to reason over than the raw file because attention lands on structure instead of drowning in tokens. The flattened DOM is precisely that for a rendered visual. A tree-sitter skeleton is to code what the denoised DOM is to an image: structure without the bytes.

One card, three ways

Here is a single asset in its three representations, so the abstraction has something to stand on. First the rendered card — the pixels a human sees:

1 · Rendered (the pixels)

Text Is the Model’s Home Turf

“Convert an image problem into a text problem and you get cheaper and more accurate at the same time.”

— Scott Farrell, LeverageAI

Second, the raw resolved DOM — faithful, complete, and almost useless to reason over. Every framework wrapper, every utility class, every layout artefact is still in the way:

2 · Raw DOM (faithful, noisy)

<div class="card sc-1x9 flex flex-col gap-3 rounded-2xl">
  <div class="sc-kicker text-amber-400 tracking-widest uppercase">
    <span>Text Is the Model's Home Turf</span></div>
  <div class="sc-quote text-white"><p>“Convert an image
    problem into a text problem and you get cheaper
    <em>and</em> more accurate at the same time.”</p></div>
  <div class="sc-attr text-slate-400" style="margin-top:-6px">
    <span>— Scott Farrell, LeverageAI</span></div>
</div>

Third, the flattened skeleton — the same asset, denoised, with stable ids and the emphasis that survived rendering kept as hints, and nothing else:

3 · Skeleton (structure without pixels)

#kicker      KICKER   "Text Is the Model's Home Turf"
#quote       QUOTE    "Convert an image problem into a text
                       problem and you get cheaper *and* more
                       accurate at the same time."
#attr        ATTRIB   "— Scott Farrell, LeverageAI"
             emphasis: quote(role=headline), attr(role=byline)

The skeleton is the smallest of the three and the one a language model handles best. It kept the words, the order, the emphasis and the roles; it dropped the pixels and the plumbing. Everything that follows is just two things you can now do with it.

Use one: judge the render deterministically, then judge the meaning

The first payoff is quality assurance, and it inverts the obvious order. Text Is the Model’s Home Turf made the case that a vision model is the wrong tool to check your own rendered output — it reads mangled, overlapping, clipped text perfectly and never notices the card is visually broken. The canonical failure from that build still says it best: a card had its attribution line overlapping the quote (a stray negative margin, visible above in representation two), and the vision model read “— Diana Hu, Y Combinator” flawlessly and flagged nothing. A deterministic pass measuring word coverage against the known text dropped to 63% — and caught it. The dumber tool was the more accurate one, because it was the right one.

The skeleton is what makes that gate clean, because it turns a picture question into two exact text questions you can answer before any model is consulted:

Did the words survive, verbatim? Recall — is every word of the source quote present in the skeleton, in order?
Did anything foreign creep in, or clip out? Precision — is there text that shouldn’t be there (a neighbour’s content the screenshot caught), or a scope that got truncated?

Both are string comparisons against text we already hold. This is the “can’t beats shouldn’t” discipline: we don’t ask the model whether the card is fine, we measure the card against the truth. Prompts are manners; a coverage check is physics. Only once a card passes the deterministic gate does an AI judge get to weigh in — and it judges the meaning (is this a compelling thing to say?), over the skeleton’s text, which is the job it’s actually good at. Ground truth to the deterministic layer; judgement to the model; pixels to neither.

The promotion: from trick to named tool

This is where the re-derivation earned its keep. The first time around, “flatten the DOM to text” was a tactic buried inside a scoring pipeline. Rebuilding it, the tactic wanted a name, because it was clearly the same shape as an idea we’d already written up for a different medium.

We have a piece about a resolution ladder — the idea that you should read (and reason) at the coarsest resolution that still answers the question, and only descend when you must. The rendered image is the bottom rung: maximum detail, maximum faithfulness, and the worst place for a language model to think. The skeleton is a rung up: less faithful, more legible, the resolution at which the model’s judgement is sharpest. Faithfulness is not usefulness. The most complete representation of a visual is the last one you want the model reasoning over.

So the DOM-denoise move stops being a trick and becomes a tool with a name — the skeleton of a visual — and the name pays off immediately, because once you have it you notice it applies to something the first article never touched: generation, not just review.

Use two: generate from the shape of the thought

The quote cards are made by two generators running in parallel. One is a direct image generator — a multimodal model asked to produce the card as an image. The other goes text → HTML/CSS → Playwright screenshot: a model writes styled markup, we render and photograph it. The second path makes an image that’s almost as good as the direct one — and I’ll come back to why “almost” is the wrong thing to optimise.

Early on, both generators were mediocre in the same way, and for a reason I didn’t see until I named the skeleton. I was handing each of them the quote as a slab of prose — the words, run together, stripped of whatever gave rise to them. But that quote hadn’t been born as prose. In the ebook it might have been the punchline under a heading, or one row of a table, or the third bullet in an escalating list. When you get a slab of text, you can’t see the shape of the thought. And the shape is most of what makes a visual good: a table wants to be laid out as a table, an escalation wants to build, a heading-plus-line wants the heading.

So I gave both generators the original structure of the quote — the bullets, the table, the hierarchy it came from — alongside the words. The same skeleton idea, pointed upstream: not “here is what the finished card looks like” but “here is the shape the content had.” Both generators got markedly better. The direct image generator composed more like the thought was actually built; the HTML/CSS generator produced layouts that echoed the source structure instead of flattening everything into a centred quote. Neither needed a cleverer prompt. Both needed a better representation to reason over — the same lesson as judging, running the other direction.

Which lands the claim the whole field note is really about: the shape of a thought is a first-class representation, distinct from both its prose and its rendering. Prose is one projection of it; the rendered card is another; the skeleton is the thing they’re both projections of. Models reason better over the shape than over either extreme — and that’s true whether you’re asking them to judge a visual (flatten the render down to the shape) or to make one (feed the shape forward instead of a slab). It is not a coincidence that structure-preserving representations keep winning; in code retrieval, chunking that respects the AST recalls the right context 70% of the time against 42% for chopping the file into fixed blocks.² Respect the structure and everything downstream gets more accurate. Destroy it into a slab and you’re asking the model to reconstruct what you threw away.

Why “almost as good” is the wrong scoreboard

Back to the two generators. On raw visual appeal, the direct image generator often edges it. But the HTML/CSS-and-screenshot path wins the thing that actually matters, because you own its DOM. You can flatten what it produced back into a skeleton and run the exact deterministic gate from Use One on it — did the words survive, did anything clip — because it’s the same representation coming and going. A directly generated image is a dead end: to QA it you’re back to pixels and vision, the very trap we climbed out of. The screenshot path closes the loop — generate, flatten, verify, and regenerate rather than patch when it fails — because the skeleton is the common currency at both ends. “Almost as good” but fully inspectable beats “slightly better” but opaque, every time you have to run it a thousand times unattended.

That inspectable loop is the whole point of designing a loop rather than a prompt: the part that makes it trustworthy is the verify phase, an adversarial check against something real. The skeleton is what gives the verify phase something real and cheap to check against — on both the images you judge and the images you generate.

Principles, distilled

Show the model the skeleton, not the picture. Render, extract the DOM, denoise and flatten to structured text. Judgement is sharpest one rung up from the pixels.
The shape of a thought is first-class. Prose and rendering are both projections of it. Reason over the shape — to judge and to generate.
Generate from structure, not from a slab. Feed generators the bullets, the table, the hierarchy the content came from. A prose slab hides the shape that made it good.
Deterministic gate before AI judge. On the skeleton, ask the exact questions first — words verbatim? anything clipped? — then let the model judge meaning.
Faithfulness is not usefulness. The most complete representation of a visual is the worst one to think over. Pick the resolution that answers the question.
Prefer the path whose output you can flatten back. Text → HTML/CSS → screenshot is “almost as good” and fully inspectable; a directly generated image is a dead end for QA. Close the loop with the skeleton at both ends.
Build the “where did I have this thought” system. The best evidence you need it is that you’ll rediscover your own findings without it. Ask me how I know.

If your pipeline produces or judges visual assets: take the one place you currently hand a model an image of your own output — to score it, to QA it, or as the thing you generate — and ask whether you could hand it the structure instead. Flatten one render to a skeleton and check the words against the source. Then tell me what it caught — I read every reply.

References

[1]Aider. “Building a better repository map with tree-sitter.” — tree-sitter parses source into an AST; a PageRank-ranked map surfaces “the structural skeleton before the details,” rendering “just signatures and structure, not full implementations.” aider.chat/docs/repomap.html · aider.chat/2023/10/22/repomap.html
[2]Code-retrieval chunking benchmarks (Cursor / Aider / Roo indexing research). — “AST chunking achieves 70.1% Recall@5 vs 42.4% for fixed-size”; retrieval quality peaks at 5–10 structure-aware chunks and degrades on large flattened blocks. Structure-preserving representations beat flattened ones.
[3]Playwright. “Aria snapshots” / accessibility tree. — locator.ariaSnapshot() and page.accessibility.snapshot() produce a deterministic text (tree) representation of a rendered page from its accessibility tree; turning a render into structured text is off-the-shelf. playwright.dev/docs/aria-snapshots

Discover more from Leverage AI for your business

Subscribe to get the latest posts sent to your email.

The Skeleton of a Visual: Judging and Generating Images Through Their Structure, Not Their Pixels