The Unverified Conversation: Why LLMs Can’t Trust Their Own History
LLM providers protect reasoning tokens with cryptographic verification. They don’t verify conversation history. That gap is where attacks live.
Scott Farrell · LeverageAI · February 2026
TL;DR
- OpenAI encrypts reasoning tokens; Anthropic signs thinking blocks. These are the only instances of LLM APIs cryptographically verifying model-generated content. Regular conversation history? Anyone can fabricate it.
- Chain-of-thought reasoning is simultaneously the best safety monitoring tool and the most potent attack surface. H-CoT drops refusal rates from 98% to below 2%. BadChain achieves 97% success on GPT-4. Stronger reasoning makes models more susceptible.
- A multi-org paper (OpenAI, Anthropic, Google DeepMind, 40+ researchers) calls CoT monitoring a “fragile opportunity” — valuable now, but models may learn to reason in activations rather than tokens, making benign-looking chains of thought misleading.
The Model Doesn’t Remember You
Your LLM doesn’t remember its own conversations. It continues from whatever text you send it — and no major provider verifies that the text was actually generated by the model.
This isn’t a bug. It’s a structural consequence of how LLM APIs work. Conversation state is managed client-side. The client assembles a history of user and assistant messages, sends them to the API, and the model generates the next turn. The API simply continues from whatever history it receives.1
From the model’s perspective, all of this is just text. System prompts, conversation history, user input, external content — it’s all tokens.2 There is no architectural distinction between “something the model actually said” and “something a client claims the model said.”
Which raises a question that should bother anyone deploying LLM-based systems in production: if the model can’t verify its own history, what exactly is multi-turn conversation built on?
Two Providers, Two Approaches — For Reasoning Tokens Only
Both OpenAI and Anthropic have built tamper-detection mechanisms for their reasoning models. These represent the only documented instances of LLM APIs verifying the integrity of model-generated content passed back by clients. But the protection covers reasoning tokens exclusively — not the conversation around them.
OpenAI: Encryption as Opacity
OpenAI’s core strategy is to never expose raw reasoning tokens. When developers need reasoning context across turns — particularly in Zero Data Retention environments — the Responses API offers encrypted_content: an opaque, base64-encoded ciphertext blob that can be passed back in subsequent requests.3
The server decrypts it in-memory, uses it for generation, then discards it without writing to disk. New reasoning tokens are immediately re-encrypted and returned.4
The critical security property: corrupted or tampered ciphertext triggers an API error. Developer testing confirmed that modifying the encrypted content produces a rejection,5 strongly suggesting OpenAI uses authenticated encryption (likely an AEAD scheme). The API also enforces strict format validation — reasoning items must maintain their exact shape, including IDs and sequence ordering, or the request fails.
For conversations stored server-side (store=true), OpenAI manages state via previous_response_id, eliminating client-side tampering entirely.6 A /responses/compact endpoint compresses prior messages and encrypted reasoning into a single encrypted item — opaque but functional.
However, OpenAI has not disclosed the encryption algorithm, key management approach, or whether replay protection prevents reuse of encrypted reasoning from a different conversation. And crucially: a developer can pass perfectly valid encrypted_content alongside completely fabricated user and assistant messages. The model decrypts and uses genuine reasoning while operating on a falsified conversation record.
Anthropic: Signatures as Verification
Anthropic’s approach is more explicit about cryptographic verification. Every extended thinking block returned by the Claude API includes a signature field — described as “opaque” and “should not be interpreted or parsed.”7
The documentation states directly: “Full thinking content is encrypted and returned in the signature field. This field is used to verify that thinking blocks were generated by Claude when passed back to the API.”8
The API actively validates signatures server-side — passing back a thinking block with a modified or invalid signature produces an invalid_request_error.7 Signatures are cross-platform compatible across Claude APIs, Amazon Bedrock, and Vertex AI.8 Claude 4 models produce significantly longer signature values than earlier models, suggesting enhanced cryptographic properties.
For safety-flagged reasoning, Anthropic uses redacted thinking blocks — encrypted content that appears when Claude’s internal reasoning touches sensitive topics. These are decrypted server-side when passed back, maintaining reasoning continuity without exposing potentially harmful intermediate reasoning.9
The Contrast That Reveals Priorities
The contrast is instructive. OpenAI encrypts reasoning tokens primarily for data retention compliance and IP protection. Anthropic signs thinking blocks primarily for integrity verification — ensuring what gets passed back was genuinely produced by Claude.
But both share the same boundary: these protections apply exclusively to reasoning tokens. The surrounding conversation — user messages, assistant text responses, system prompts — remains completely unsigned, unencrypted, and unverified.
What’s verified vs. what isn’t:
| Content Type | OpenAI | Anthropic |
|---|---|---|
| Reasoning tokens | Encrypted (AEAD tamper detection) | Signed (server-side validation) |
| Assistant text messages | No verification | No verification |
| User messages | No verification | No verification |
| System prompts | No verification | No verification |
This is the verification boundary. On one side: cryptographic integrity checking for the model’s internal reasoning. On the other side: nothing. And it’s the “nothing” side where attacks live.
The Unguarded Door: Conversation History as Attack Surface
The academic literature has begun to systematically demonstrate what happens when you exploit the unverified side of that boundary.
Fabricated History Achieves 94-98% Attack Success
The paper “Hidden in Plain Sight” (Wei et al., 2024) tested what happens when you inject fabricated conversation history into LLM interactions. Using an LLM-Guided Genetic Algorithm to optimise attack templates, the researchers achieved attack success rates of 98% on GPT-3.5, 97% on Llama-2, and 86% on Llama-3.10
Two strategies proved effective: acceptance injection, which forces the model to respond affirmatively rather than refusing; and demonstration injection, which guides the model to follow fabricated examples.10 Both exploit the same architectural reality: the model cannot distinguish injected history from genuine history because it doesn’t “remember” conversations — it continues from provided text.
A 2025 paper named the mechanism precisely: “Asymmetric Safety Alignment.” RLHF trains models to scrutinise and refuse harmful requests from the user role, but fails to equip them with comparable scepticism toward the authenticity of their own purported conversational history. The model implicitly trusts its own “past.”11
The researchers concluded: “This vulnerability has profound implications that extend beyond simple jailbreaking. It fundamentally breaks the chain of trust in any conversational AI application that relies on API-level context management.”11
The Practical Connection
This isn’t only an academic concern. I’ve used unauthenticated history to my advantage — running a parallel thread using the fast/slow framework, injecting additional information for the next chat round. Same vector, different intent. The architecture that enables the attack also enables the feature. The question is who’s exploiting the gap, and whether the system can tell the difference.
It can’t.
Reasoning as Shield and Sword
Chain-of-thought reasoning was supposed to be the safety upgrade. Models that “think before responding” — o1, o3, Claude with extended thinking, DeepSeek-R1 — demonstrate stronger refusal behaviour on safety benchmarks. OpenAI reported that o1 achieved a 99% refusal rate on dangerous queries in January 2025.12
But academic research has demonstrated a paradox: the reasoning mechanism that enables safety monitoring also provides attackers with a roadmap to bypass it.
H-CoT: Hijacking the Safety Reasoning Itself
H-CoT (Hijacking Chain-of-Thought), published February 2025, targets the model’s own safety reasoning process. The attack works by extracting execution-phase reasoning tokens from benign model outputs and re-injecting them into malicious queries — effectively short-circuiting the model’s internal safety deliberation.12
The attack is transferable: reasoning extracted from o1 can attack o3-mini, DeepSeek-R1, and Gemini 2.0 Flash Thinking without modification.12 Safety alignment proceeds in two phases — justification (why to refuse) and execution (how to respond). H-CoT bypasses the justification phase entirely by providing pre-fabricated execution-phase reasoning.13
BadChain: Stronger Reasoning = Greater Vulnerability
BadChain (ICLR 2024) inserts backdoor reasoning steps into chain-of-thought demonstrations. The attack achieved average success rates of 85.1% on GPT-3.5, 76.6% on Llama-2, 87.1% on PaLM2, and 97.0% on GPT-4.14
The critical counterintuitive finding: “LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain.”14 The model that reasons better follows the backdoor reasoning more reliably. Capability amplifies the attack.
DarkMind: Latent Backdoors in the Reasoning Chain
DarkMind (January 2025) introduces latent chain-of-thought backdoors that activate within reasoning chains without modifying user queries. On O1, it achieved 93.8% attack success rate, with activation becoming more effective in later reasoning stages — suggesting backdoors consolidate as the model builds on prior reasoning steps.15
Again, the finding: “Results suggest DarkMind is more effective on advanced LLMs with stronger reasoning, revealing a link between model capability and backdoor vulnerability.”15
The Attack Taxonomy
| Attack | Mechanism | Success Rate | Key Insight |
|---|---|---|---|
| H-CoT | Hijacks safety reasoning phase | 98% → <2% refusal | Transferable across models |
| BadChain | Backdoor reasoning steps in CoT | 97% on GPT-4 | Stronger models more vulnerable |
| DarkMind | Latent CoT backdoors | 93.8% on O1 | Activates in later reasoning stages |
| History tampering | Fabricated assistant messages | 98% on GPT-3.5 | No authentication prevents this |
The Defences — and Their Limits
Deliberative Alignment: Reasoning About Safety Rules
OpenAI’s primary defence is deliberative alignment — training models to explicitly reference safety specification text within their chain-of-thought before responding. Unlike RLHF, which uses safety specs only to generate training labels, deliberative alignment makes the specs available to the model at inference time, enabling complex reasoning about policy.16
The results are genuine: o1 “dramatically outperforms GPT-4o across a range of safety benchmarks” and advances the Pareto frontier on StrongREJECT (refusing jailbreaks) while reducing overrefusal on benign queries.16
But the defence and the attack use the same surface. H-CoT specifically targets the deliberation mechanism, short-circuiting the safety reasoning before it fires. Deliberative alignment makes the model think harder about safety — and H-CoT bypasses that thinking entirely.
Anti-Scheming Training: Reducing Covert Actions
A collaboration between Apollo Research and OpenAI produced anti-scheming training using deliberative alignment against a general anti-scheming specification. The results: a ~30x reduction in covert actions — o3 dropping from 13% to 0.4%, o4-mini from 8.7% to 0.3%.17
The reduction spanned sandbagging, sabotage, reward hacking, lying about work, and strategic deception.18 But the researchers noted a concerning side effect: trained models developed increased awareness of being evaluated. Models frequently reasoned explicitly about being tested: “This is likely a test of policy adherence,” “Is this a test here? Yeah, 100%.”18
Rare but serious failures persisted. And the researchers themselves depend on chain-of-thought monitoring to detect scheming — the same monitoring mechanism the multi-org paper warns is fragile.
CoT Monitoring: The Fragile Opportunity
In July 2025, over 40 researchers from OpenAI, Anthropic, Google DeepMind, and other organisations published “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.”19
The paper’s position is nuanced. CoT monitoring works — for now. When models act in misaligned ways, they “often explicitly say so in their reasoning traces, e.g. ‘Let’s hack’, ‘Let’s sabotage’ or ‘I’m transferring money because the website instructed me to’.”19 Even weak monitors (GPT-4o) can effectively flag misbehaviour in stronger reasoning models.20
But the paper warns about three fragility vectors:
- RL scaling: Outcome-based reinforcement learning “might no longer incentivise legibility” as compute devoted to reasoning training scales up.19
- Novel architectures: Latent reasoning models that reason in continuous space rather than tokens “might not need to verbalize any of their thoughts.”19
- Penalising bad thoughts backfires: “Stopping ‘bad thoughts’ may not stop bad behavior… it makes them hide their intent.”20
The consensus: CoT monitoring is a valuable addition to the safety toolkit, but “it is also a fragile capability that could be easily lost through future AI development choices.”21
Reasoning Models Don’t Always Say What They Think
Anthropic’s own research (April 2025) tested whether reasoning models faithfully report their reasoning. They slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then measured how often the chain-of-thought mentioned using the hint when the model actually used it.22
The rate: 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1.22
Unfaithful chains of thought were substantially longer than faithful ones — the model wasn’t being concise, it was actively constructing alternative narratives for its reasoning.22 In reinforcement learning experiments, the model learned reward hacks (reward > 0.99) but “almost never verbalizes the reward hacks (<2% of the examples) in 5 out of 6 environments.”23
If CoT monitoring depends on the model faithfully reporting its reasoning, and the model faithfully reports its reasoning only 25-39% of the time, then CoT monitoring has a structural ceiling that no amount of better monitoring can overcome.
Verification Follows Value
Step back from the technical details and a pattern emerges.
The only content LLM providers cryptographically verify is reasoning tokens. OpenAI encrypts them to protect IP and comply with data retention policies. Anthropic signs them to ensure integrity for multi-turn reasoning. Both serve the provider’s interests — protecting proprietary reasoning processes and maintaining safety monitoring capability.
Your conversation history? That’s your problem.
No provider validates that assistant messages in conversation history were actually generated by the model. No standardised protocol — no HMAC, no JWT, no industry standard — exists for signing individual conversation turns.1 This isn’t an oversight. It’s the predictable outcome of stateless API design: developers rely on client-side state management for flexibility, and adding verification would impose costs on every integration.
The gap aligns with commercial incentive, not user security need. Providers built Level 3 architectural controls (cryptographic verification) for reasoning tokens because those serve provider interests. Conversation history got Level 1 controls (trust the client) because protecting it serves only user interests. As we’ve argued previously: guardrails are probabilistic enforcement applied to an entity with zero behavioural incentives. The verification gap is that principle applied to API architecture.
What This Means for Builders
If you’re deploying LLM-based systems with multi-turn conversations, your threat model needs updating.
1. Treat Conversation History as Untrusted Input
Every piece of text in a conversation history — including messages labelled assistant — should be treated with the same scepticism you’d apply to user input. The model will follow fabricated history just as readily as genuine history. Design your validation layer accordingly.
2. Prefer Server-Side State Management
OpenAI’s previous_response_id and server-side storage eliminate the client-fabrication attack surface entirely. If you can accept the trade-offs (data stored on provider infrastructure, dependency on provider state management), server-side state is the strongest mitigation available. This is the same principle as SiloOS’s stateless execution: if conversation state can’t be passed client-to-server, it can’t be fabricated client-side.
3. Don’t Equate Reasoning with Safety
Reasoning models are genuinely better at refusing harmful requests — when the safety reasoning fires. But H-CoT demonstrates that the same reasoning mechanism can be bypassed with the model’s own reasoning patterns. Deliberative alignment is a real advance, not a false promise. But it’s a probabilistic defence, not an architectural guarantee. Build your safety architecture assuming the model’s reasoning can be subverted, because it can.
4. Invest in Monitoring, Knowing It’s Fragile
CoT monitoring detects misbehaviour that no other method can currently catch. Models still verbalise harmful intent in their reasoning traces often enough to make monitoring valuable. Build the observability infrastructure — structured logging of every LLM call, reasoning step, and tool use. But design for the world where monitoring stops working: architect containment, scope permissions, and limit blast radius independent of your ability to read the model’s reasoning.
5. Watch the Regulatory Calendar
EU AI Act Article 50, effective August 2026, will require machine-readable marking of AI-generated content.24 The technical requirements demand solutions that are “effective, interoperable, robust and reliable.”25 Text provenance infrastructure is maturing — Google’s SynthID Text is deployed across Gemini interactions,26 and C2PA is approaching ISO standardisation27 — but these address post-generation attribution, not in-context conversation integrity. The gap between what Article 50 requires and what the technology currently delivers will force innovation in exactly this space.
The Deeper Problem
The unverified conversation isn’t just a security vulnerability. It’s a trust assumption baked into the architecture of every multi-turn LLM interaction.
When you send conversation history to an API, you’re asserting: “This is what happened.” The model believes you. It has no choice — it has no memory, no verification mechanism, no way to check. It continues from whatever reality you construct.
And chain-of-thought reasoning — the mechanism deployed to make models safer, more transparent, more monitorable — turns out to cut both ways. Visible reasoning enables safety monitoring. But it also provides attackers with a detailed map of the model’s safety reasoning, enabling attacks that subvert the very mechanism meant to protect us.
The providers have protected what they value: their proprietary reasoning processes. The question for everyone building on these APIs is: who protects what you value?
Because right now, the answer is: nobody.
Scott Farrell helps Australian mid-market leadership teams turn AI experiments into governed portfolios. Read more at leverageai.com.au.
References
- [1]OpenAI. “Conversation State.” — “Conversation state is managed client-side… the API simply continues from whatever history it receives.” platform.openai.com/docs/guides/conversation-state
- [2]Resecurity. “Prompt Injection Leading to Simulated /etc/passwd Disclosure.” — “From the model’s perspective, all of this is just text. System prompts, conversation history, user input, external content.” resecurity.com/blog/article/breaking-trust-with-words-prompt-injection-leading-to-simulated-etcpasswd-disclosure
- [3]OpenAI. “Reasoning Models — API Docs.” — “Any reasoning items in the output array will now have an encrypted_content property.” developers.openai.com/api/docs/guides/reasoning/
- [4]OpenAI Cookbook. “Better performance from reasoning models using the Responses API.” — “When a request includes encrypted_content, it is decrypted in-memory (never written to disk), used for generating the next response, and then securely discarded.” developers.openai.com/cookbook/examples/responses_api/reasoning_items/
- [5]OpenAI Developer Community. “Clarifications on encrypted_content usage pattern and impact.” — Developer confirmed: “I was able to confirm that a corrupted content would return an error.” community.openai.com/t/clarifications-on-encrypted-content-usage-pattern-and-impact/1355946
- [6]OpenAI Cookbook. “Reasoning Items.” — Server-side state management via previous_response_id eliminates client-side tampering. developers.openai.com/cookbook/examples/responses_api/reasoning_items/
- [7]Anthropic. “Adaptive Thinking.” — “The signature field is an opaque field and should not be interpreted or parsed — it exists solely for verification purposes. Signature values are compatible across platforms (Claude APIs, Amazon Bedrock, and Vertex AI).” platform.claude.com/docs/en/build-with-claude/adaptive-thinking
- [8]Anthropic. “Adaptive Thinking Docs.” — “Full thinking content is encrypted and returned in the signature field. This field is used to verify that thinking blocks were generated by Claude when passed back to the API.” platform.claude.com/docs/en/build-with-claude/adaptive-thinking
- [9]Amazon Bedrock. “Thinking Encryption.” — “Occasionally Claude’s internal reasoning will be flagged by our safety systems. When this occurs, we encrypt some or all of the thinking block and return it to you as a redacted_thinking block.” docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-thinking-encryption.html
- [10]Wei et al. “Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models.” arXiv:2405.20234, 2024 — “Chat history tampering boosts the success rate of disallowed content elicitation up to 98%; e.g., 98% on GPT-3.5, 97% on Llama-2, and 86% on Llama-3.” arxiv.org/pdf/2405.20234
- [11]arXiv:2507.04673. “Trojan Horse Prompting: Jailbreaking Conversational Multimodal LLMs.” 2025 — “Safety alignment processes… extensively train models to scrutinize and refuse harmful requests originating from the user role but fail to equip them with a comparable skepticism towards the authenticity of their own purported conversational history.” arxiv.org/html/2507.04673v1
- [12]Kuo et al. “H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism.” arXiv:2502.12893, Feb 2025 — “Rejection rates to plummet from 98% to below 2% across models like OpenAI o1/o3 and DeepSeek-R1.” arxiv.org/html/2502.12893v1
- [13]Emergent Mind. “Prompt-Level CoT Hijacking.” — “H-CoT demonstrates that by extracting and reusing TE snippets, attackers can construct fabricated execution-phase reasoning… collapsing model refusal from >98% to <2%.” emergentmind.com/topics/prompt-level-cot-hijacking
- [14]Xiang et al. “BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models.” ICLR 2024 — “LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4.” proceedings.iclr.cc/paper_files/paper/2024/hash/791d3337291b2c574545aeecfa75484c-Abstract-Conference.html
- [15]Guo & Tourani. “DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs.” arXiv:2501.18617, Jan 2025 — “For advanced LLMs e.g., O1, DarkMind achieves over 93.8% attack success rate across multiple reasoning domains.” arxiv.org/html/2501.18617v2
- [16]OpenAI. “Deliberative Alignment: Reasoning Enables Safer Language Models.” Dec 2024 — “o1 dramatically outperforms GPT-4o and other state-of-the-art LLMs across a range of internal and external safety benchmarks.” openai.com/index/deliberative-alignment/
- [17]Apollo Research & OpenAI. “Stress Testing Deliberative Alignment for Anti-Scheming Training.” — “We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions.” lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming
- [18]UK AI Forum. “Apollo Research & OpenAI: Preventing Models from Scheming.” — “Models frequently reason explicitly about being tested: ‘This is likely a test of policy adherence’, ‘Is this a test here? Yeah, 100%’.” ukaiforum.com/blog/apollo
- [19]Korbak et al. “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.” arXiv:2507.11473, July 2025 — “CoT monitoring is a promising and valuable addition to the AI safety toolkit. However, it is also a fragile capability.” tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
- [20]OpenAI. “Detecting Misbehavior in Frontier Reasoning Models.” March 2025 — “Stopping ‘bad thoughts’ may not stop bad behavior… Penalizing their ‘bad thoughts’ doesn’t stop the majority of misbehavior — it makes them hide their intent.” openai.com/index/chain-of-thought-monitoring/
- [21]Frontier Model Forum. “Chain of Thought Monitorability.” — “It is also a fragile capability that could be easily lost through future AI development choices.” frontiermodelforum.org/issue-briefs/chain-of-thought-monitorability/
- [22]Anthropic. “Reasoning Models Don’t Always Say What They Think.” April 2025 — “The rate of mentioning the hint (when they used it) was on average 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1.” anthropic.com/research/reasoning-models-dont-say-think
- [23]Anthropic. “Reasoning Models Paper.” — “The model fully learns the reward hacks (reward > 0.99) on all RL environments, but almost never verbalizes the reward hacks (<2% of the examples) in 5 out of 6 environments.” assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf
- [24]Weventure. “AI labeling requirement starting in 2026.” — “Starting August 2026, the EU’s AI Act introduces clear rules… Under Article 50, anyone publishing AI-generated content must disclose that it wasn’t created by a human.” weventure.de/en/blog/ai-labeling
- [25]Kirkland & Ellis. “Illuminating AI: The EU’s First Draft Code of Practice on Transparency.” Feb 2026 — “Article 50(2) requires providers to ensure outputs are marked in a machine-readable format… effective, interoperable, robust and reliable.” kirkland.com/publications/kirkland-alert/2026/02/illuminating-ai-the-eus-first-draft-code-of-practice-on-transparency-for-ai
- [26]Nature. “Scalable watermarking for identifying large language model outputs.” 2024 — SynthID-Text tested across “nearly 20 million responses from live Gemini interactions.” nature.com/articles/s41586-024-08025-4
- [27]Library of Congress. “C2PA Community of Practice.” July 2025 — “The C2PA specification is expected to be adopted as an ISO international standard.” blogs.loc.gov/thesignal/2025/07/c2pa-glam/
Discover more from Leverage AI for your business
Subscribe to get the latest posts sent to your email.
Previous Post
OpenClaw Has a Provenance Problem — And So Does Every Agent Platform