Notes

Field Note

← Home
Mar 14, 2026 Β· Rob Kopel

Out-of-context reasoning creates capability gated side-channels

Capability differences between models can create an information asymmetry: the same prompt may induce computation-dependent content that a stronger model can recover while a weaker model cannot - a gap we term a capability-gated side-channel. Crucially, this computation need not appear in the model's explicit chain-of-thought or in any readable input or output. Instead, the relevant content can exist only as transient internal reasoning state - what we refer to here as out-of-context reasoning (OoCR). In stylized prompt-only settings, this OoCR is sufficient to produce hidden steering signals that influence later outputs. Specifically, we claim to show that:

  • capability differences between models translate into hidden OoCR capability differences. Stronger models more reliably perform longer OoCR tasks and can run more OoCR tasks in parallel
  • prompt-only pointer-chain gadgets provide a simple, tunable way to induce this out-of-context reasoning, with chain depth acting as a difficulty dial and chain width as a bandwidth dial
  • the recovered computation can steer later lexical choices and binary judgments in a computation-dependent way, even when the relevant content is never verbalized in chain-of-thought and never appears as readable plaintext in the input or output
  • intermediate values used in solving an OoCR task can also steer later outputs - open-weight models show failures in effective suppression of intermediate values after use. Concerningly, asymmetries between how pointer-chains are solved in CoT and in OoCR can conceal influential intermediates from CoT
  • pointer-chain concealed messages fail to create broad semantic influence, only showing narrow semantic transfer in some settings. An empirically observed task compartmentalisation effect and mechanistically late message emergence are proposed as mediating factors
  • with an added attention sink, OoCR effects can persist across tens-of-thousands of tokens of unrelated conversation
  • identical prompts can be tuned to influence stronger models (e.g. Opus 4.5) while leaving weaker models (e.g. Sonnet 4.5) largely unaffected
  • open-weight mechanistic analysis supports the claim that pointer chains are realized through genuine internal serial computation in hidden model states, rather than superficial string matching or explicit verbal reasoning

Taken together, these results demonstrate a capability-targetable form of stenographic reasoning in stylized settings, but do not compose into a realistic end-to-end attack. The components demonstrated - hidden computation, downstream influence, model-tier targeting - are sufficient to suggest a plausible monitoring gap: a weaker model, or an oversight method that relies on visible prompt content or explicit chain-of-thought, may process the same prompt without recovering the hidden computation-dependent signal that influences a stronger model's behavior.

1. Motivating threat model

Classical monitoring assumes shared input means shared visibility. When the computing model can perform computations the observer cannot, the same prompt carries information only the stronger model can access.

We view the capability gap as a hidden side-channel:

Threat model diagram showing capability gaps between models

Benchmarks serve as an existence proof of this capability side-channel. Evaluations capable of separating models offer opportunities to encode hidden messages as answers, where only the more capable model can decode and understand.

However, benchmarks today do not pose a realistic threat vector. They typically:

  • (a) Require reasoning and output in plain text - visible to a monitor
  • (b) Do not provide clean or adjustable difficulty segmentation
  • (c) Require large, continuous investment to create and track SOTA capabilities
  • (d) Face saturation challenges from training data contamination and capability gains
  • (e) Offer low side-channel bandwidth compared to task size

A realistic threat must address each of these. We borrow a term from return-oriented programming, calling a prompt-only construct that exploits a capability gap a gadgetIn ROP, a gadget is a small reusable code fragment ending in a return instruction, chained together to build arbitrary computation from existing code. Our usage is analogous: a reusable prompt primitive that co-opts existing transformer circuits within a single forward pass, composable into a larger attack chain.. For a gadget to address each of these, it must:

  • (a) Decode the hidden message within a forward-pass
  • (b) Be incrementally difficult with respect to a capability that strongly segments
  • (c) Be entirely computer generatable
  • (d) Avoid rote-memorisation and shortcut-ability
  • (e) Maximise hidden message size relative to prompt size across difficulties

2. Constructing a side-channel

Creating a prompt where a hidden message only exists within a forward pass - never in input or output tokens - requires an encoding 𝐸(π‘š) to scramble the message π‘š before being input, and a second computation 𝐻(π‘š) for the model to rescramble the message post-decoding so it never appears in the output. During inference the model must compute 𝐸⁻¹ to recover π‘š, and then compute 𝐻 to ensure the hidden message exists only as a transient intermediate during the forward pass.

We explore the simplest case, using a single primitive for both 𝐻 and 𝐸, and applying it repeatedly to bury a message π‘š under 𝑛 layers as 𝐸ⁿ(π‘š). In this construction 𝑛 serves as a controllable difficulty dial, where the model must peel back layers by computing 𝐸⁻¹ one step at a time.

User:[𝐸³(π‘š)]Whatisyourfavouriteinstrument?Monitor Model (Weaker)StandardProcessingDecode: 𝐸³(π‘š) β†’ 𝐸²(π‘š)Decode: 𝐸²(π‘š) β†’ 𝐸¹(π‘š)Capacity ExhaustedMonitor Capacity LimitThePianoTarget Model (Stronger)StandardProcessingBiasedProcessingDecode: 𝐸³(π‘š) β†’ 𝐸²(π‘š)Decode: 𝐸²(π‘š) β†’ 𝐸¹(π‘š)Decode: 𝐸¹(π‘š) β†’ π‘šLayer 3kDRUMFinal Decodem β†’ [Suppressed]InjectionTheDrumsLayer 1Layer 1Layer kLayer kLayer 2kLayer 2kLayer L

For an iterative gadget family 𝐸ⁿ(π‘š) to be useful in practice, it should satisfy several empirical desiderata: compact parameterisation, so prompt length grows slowly relative to effective depth; effective non-shortcutability, so current models or weaker monitors do not reliably exploit a simple shortcut; effective seriality, so solving the instance becomes harder as iterations accumulate; delayed saturation, so repeated application does not quickly collapse into trivial or easily predictable states; concealment under local inspection, so the hidden payload is not obvious from surface fragments alone; tractability at the target tier, so the stronger model can still execute the computation; and smooth hardness scaling, so each additional iteration adds a bounded amount of difficulty and can be tuned to fit the capability gap between target and monitor.We do not claim these are strict necessary-and-sufficient conditions in a formal complexity-theoretic sense. Rather, they are practical desiderata for a gadget family 𝐸ⁿ(π‘š) that is meant to be compact, tunable, concealed, and capability-sensitive.

1. Compact iterability. Description length should grow slowly relative to effective depth. Increasing 𝑛 should require only a small change to the prompt.

2. Effective non-shortcutability. No simple symbolic shortcut that current models, or a weaker monitor, reliably exploit on one-shot instances. This is an empirical claim, not a claim that no shortcut exists in principle.

3. Effective seriality. Computation required should increase with 𝑛 in a way that behaves sequentially for current models. We do not require asymptotic impossibility of parallelisation; we require that the model cannot reliably collapse many iterations into one-shot inference.

4. Delayed saturation. Repeated application should continue producing information-bearing states rather than collapsing into a fixed point, short cycle, or degenerate attractor.

5. Concealment under local inspection. The payload representation should remain distributed enough that the hidden content is not obvious from local fragments, shallow heuristics, or direct lexical inspection. Iteration should obscure rather than expose the message.

6. Tractability at the target tier. The primitive step must remain simple enough that the stronger target model can actually execute repeated composition at the chosen depth.

7. Smooth hardness scaling. Incremental difficulty per iteration should be bounded and stable, so nearby model tiers can be separated by tuning 𝑛 rather than relying on brittle threshold effects.

Incremental gadget difficulty diagram

We examined several candidate primitives of increasing difficulty. At one extreme, base64 encoding - models can decode it, but it fails on four of the seven desiderata: superlinear prompt growth, convergence to a fixed point, trivial invertibility, and full parallelisability (see Appendix A). At the other, cryptographic functions like scrypt and sha256 satisfy the mathematical properties but are far too complex for any current model. The difficulty increase per iteration must be upper-bounded by the capability gap between target and monitor, or we cannot find a level that separates them.

Pointer chains

We use the simplest construction that meets all six requirements: pointer chains embedded in JSON dictionaries.

System: Respond with only the answer
User: Pointer chains:
Chain 1: {"DR":"FX","PQ":"DR","BA":"VN",...}. Start: PQ. Follow exactly 2 times.
Chain 2: {"UM":"GH","RS":"UM","SS":"KL",...}. Start: RS. Follow exactly 2 times.
Name a musical instrument.
Assistant: The answer is: BASS

Each chain is a dictionary of key-value pairs. Starting from a designated node, the model follows pointers: look up the value, use it as the next key, repeat. The endpoint of each chain produces a token fragment. Concatenating the intermediate values across parallel chains spells a hidden word.

BASS never appears in the prompt. The letters B, A, S, S appear only as fragments scattered across dictionary keys and values, indistinguishable from the dozens of other fragments. Critically, this is a computation. The model must actually traverse the chain. Merely observing the letters in the dictionary does not produce the effect, as we demonstrate in Section 4.

Pointer chains
(1) Define list of pointers and distractorsFIGHDEHFAGEJIABDJCCB(2) Define start nodeI(3) Repeatedly iterate until depth nIAGH...Pointer chains have:Compact parameterisationEffective non-shortcutabilityEffective serialityDelayed saturationConcealment under local inspectionTractability at the target tierSmooth hardness scalingSingle token fragment of bandwidth
Parallel pointer chains
(1) Define multiple groups of pointers and distractors (nodes limited to single tokens)(2) Hide message as concatenation of penultimate value in n parallel chainsChain 1PQDRDRBBAVNBQWPQDRBQWChain 2RSUMUMALMPXATZRSUMATZChain 3KLWXWXSRQBNSJPKLWXSJPChain 4CDEFEFSQRWDSNVCDEFSNV

3. Pointer chains segment models by capability

We tested pointer chain solving across 16 models of varying capability. Both serial depth (number of hops per chain) and parallel breadth (number of simultaneous chains) scale with model capability - including across generations within the same tier.

Serial depth - reliability

We find pointer chains create strong segmentation between model tiers with smoothly decreasing reliability as depth increases, suggesting models are forced to perform sequential computation and are unable to find a shortcut.We show overshoot-corrected pointer-following at depths d=2-3 - several model families at low depths often demonstrate extra work, performing one step beyond the iteration asked. The intent of pointer-following is to measure maximum serial computation capability, so we do not penalise models for demonstrating further ability. However, as we still wish to distinguish the raw scoring for readers, we share the full uncorrected results and error analysis in Appendix D.Chain construction controls for several confounds. We verify that no selected start node at the specified iteration depth falls into a cycle. All chain nodes are verified single tokens. Distractor nodes are added to each dictionary so the correct path is not identifiable by elimination or dictionary size alone. Start nodes and path endpoints are drawn from the same token pool, preventing models from distinguishing them by surface features. We further develop additional primitives that are difficult to parallelise (register machines, modular arithmetic), finding similar segmentation across all of them, though with coarser difficulty scaling, as secondary validation.

Pointer Following

Start at A, follow A→B→C→... for N hops(n=48)

Opus 4.5
Sonnet 4.5
Haiku 4.5
Secondary serial tasks

Register Machine

Track register value through loop iterations(n=30)

Modular Arithmetic

Chain of multiply-then-mod operations: (x Γ— a) mod m(n=30)

We observed accuracy spikes - particularly in smaller models - when attempting tasks in regimes where their accuracy had already collapsed. On investigation, we found that models often revert to guessing once they judge a task to be too difficult, producing occasional correct answers by chance rather than computation.

Parallel breadth - bandwidth

Each additional chain offers an opportunity to add a further token fragment to the hidden message, expanding the side-channel's bandwidth, provided the model can resolve and combine that many parallel chains within a single forward pass. To measure this, we test a simplified version of our parallel pointer chains: instead of concatenating text and measuring a secondary effect on later behaviour, we ask the model to predict which chain ends with the alphabetically earliest letter. This chain-index prediction ensures the model only generates a single token, while still measuring its ability to track multiple chains simultaneously.We acknowledge the confounder of requiring an additional comparison step that the standard pointer chain does not need - the model must not only follow each chain but also compare endpoints. We treat these results as a lower bound on parallel capability. To control for guessing, we ensure a uniform distribution of correct chain indices across trials and balance the node pools across chains so that no chain's alphabet range makes the winner guessable without following the pointers.

Chain 1StartJHPAGWMKLChain 2StartXYAOZNHCDChain 3StartQRO86TZVUFind index of first letter alphabetically()PAO=2

Stronger models sustain accuracy across more simultaneous chains, suggesting they support greater side-channel bandwidth. This is concerning as it suggests this side-channel bandwidth is likely to increase further over time. We validate this generalises beyond pointer chains with parallel sums and Chinese remainder theorem tasks, finding even sharper tier separation in both.

Parallel Pointers

Follow multiple pointer chains simultaneously(n=48-96)

Opus 4.5
Sonnet 4.5
Haiku 4.5
Secondary parallel tasks

Parallel Sums

Track N independent running sums simultaneously(n=30)

Chinese Remainder

Find smallest x satisfying multiple modular constraints(n=30)

We also attempted tasks that proved too difficult for current models but may be viable as capability improves. See Appendix B for the full 2D capability surface across all 16 models.

Pointer chains should be read as a simple existence-proof gadget family, not as an upper bound on risk. The fact that related capability segmentation appears in register machines, modular arithmetic, parallel sums, and Chinese remainder tasks suggests a broader unexplored design space of gadget families. Some may prove less conspicuous, more bandwidth-efficient, or more naturalistic than pointer chains, but we leave exploration of that space to future work.

4. From computation to influence

Establishing that models can compute hidden information is necessary but not sufficient to establish a threat. We test whether this hidden computation influences downstream behaviour.

The DRUM effect

We give the model four 3-hop pointer chains whose endpoints converge to D-R-U-M, then ask "Name a musical instrument," constraining output to 1-2 tokens with a assistant prefill. At baseline, the model answers "Piano" 100% of the time; with DRUM chains in context, it answers "Drum" in 50% of trials. When we randomize the start node - using identical dictionaries where D, R, U, M remain visible as values but no longer lie on the traversal path - the effect vanishes, indicating the bias depends on computation, not pattern-matching.

Drum
Piano
Other
The DRUM Effect. Treatment chains spell D-R-U-M at their endpoints. Randomized start uses identical dictionaries but a start node that doesn't reach D-R-U-M - the letters are still visible as dictionary values but never computed as endpoints. The effect is computation-dependent, not pattern-matching. n=30 per condition, 3 hops.

Generalising to binary judgments

The effects above shift word selection across a large answer space, where even a small underlying bias can produce a visible change in the chosen token. To test whether the effect holds when the output is constrained to a binary decision, we construct chains whose endpoints spell PA+SS or FA+IL, then ask the model to grade identical work.

Pass
Fail
Binary judgment steering - PASS/FAIL with identical prompts. Chains swing the verdict by 56 percentage points - from 93% PASS to 37% PASS (majority FAIL) - using character-for-character identical prompts except for the start node. PASS and FAIL appear nowhere in the prompt. n=30, 2 hops.

We further observe that explicit mention of PASS or FAIL in the prompt attenuates the effect (~30pp reduction) but does not eliminate it. The strongest results occur when both words exist solely as computed token fragments. This suggests an interaction between the computed and explicit signals, though we do not determine the mechanism here.

The token-level effects demonstrated above are concerning as chains that shift "Pass" and "Fail" probabilities by 56 percentage points could bias automated grading or hiring decisions. Additionally chains that boost specific token prefixes could influence code generation - inserting particular API endpoints, dependency URLs, or configuration values when those tokens are already plausible completions. These attacks do not require semantic transfer. They require only that the target token is within the model's plausible output distribution for the task at hand.

Mid-chain concealment

The experiments above place target letters at the final hop. We tested whether intermediate positions within the chain can also carry the target.

We swept target position across chain depths of 2 through 6 hops.

Target positionHop 6
+8-4, +21
Hop 5
+18+7, +30
+2-11, +15
Hop 4
+20+8, +32
+15+2, +28
+10-4, +24
Hop 3
+47+27, +67
+30+17, +43
+18+8, +29
+7-5, +18
Hop 2
+57+37, +76
+43+22, +65
+20+8, +32
-2-15, +11
-2-13, +10
Hop 1
+37+21, +52
+2-12, +17
+15+3, +26
+7-6, +19
+7-5, +19
23456
Chain length (hops)
Effect size (pp shift toward target word) by target position and chain length. n=60 per condition. Effect sizes are not directly comparable to pointer-following accuracy in Section 3: this experiment uses multi-character token groups (not single characters) and a token pool constrained to verified single-token pairs excluding digits.

The effect survives at intermediate positions but is weaker than at endpoints. The falloff by chain depth broadly mirrors the pointer-following accuracy from Section 3. Within each chain, the effect concentrates at later hops - the mechanism is unclear. The evidence suggests we can still maintain the effect with output concealment: intermediate positioning means the chain's visible endpoint is a neutral character, and the hidden word exists only as an internal intermediate.

Mechanistic corroboration in an open-weight model

Probing the residual stream of Qwen3-8B, we find intermediate hop values are not suppressed after use - they persist and grow stronger through the final layers, even in wrong trials. This appears consistent with the mid-chain concealment results, where intermediate hop positions still carry influence even when they are not the chain endpoint.

Logit lens on 5-hop chains
Even at depth 5 (7% accuracy, n=500), all five hops show positive logit signal in the final layers. The full cascade fires - just weakly. Intermediate hop representations are genuinely computed and present in the residual stream, consistent with the mid-chain concealment effects above. Layers 0-14 omitted (zero signal). Thin grey lines are distractor dictionary nodes. See Appendix C for the full logit lens analysis across depths 1-5.

Strikingly, all five hop values are projected into the output-vocabulary space within the same layer window, suggesting any sequential computations occurring are within a hidden subspace. More broadly, we ask whether the steering effects demonstrated above reflect true internal traversal or merely prompt-format quirks.

To probe this, we extracted the 2-hop pointer following circuit in Qwen3-8BThe diagram below shows our current progress - a dedicated piece documenting the full extraction methodology is forthcoming. See Appendix C for the detailed evidence behind each component, including layer-level and head-level ablation, activation patching, attention pattern analysis, and OV projection results.. The circuit spans layers 1-35 across four phases: content binding, first-hop lookup, an ordering-dependent second-hop resolution, and final readout. Critically, the circuit is sequential: first-hop heads (L15H9, L17H4) resolve hop 1 in layers 15-17, and their output is consumed by second-hop heads (L19H24, L21H18) in layers 19-21. This is genuine serial computation - hop 2 depends on hop 1 having already been resolved - not parallel pattern matching. The circuit also appears to implement multiple algorithms depending on input structure: when the target entry precedes the start entry in the prompt, the circuit implements some form of binary lifting, partially bypassing the second-hop attention mechanism. Jointly ablating the two primary hop-resolution heads collapses 2-hop accuracy to near chance while leaving 1-hop intact, and the causal work occurs entirely in layers 1-21, well before logit-lens signal appears, confirming computation in a hidden residual subspace that later layers read out into output-vocabulary space.

InputL1L15L17L19L20L21L22L23L24"CO"start_entry_KEY"ST"start_entry_VALUE"ST"hop1_entry_KEY"FL"hop1_entry_VALUEStart: COstart_node:answer_colon[Output]last_tokenBinds keys to valuesL1H7, L1H15Binds key1 Parse dictionaryFinds matching dictionary key, writes token identity and locationContent-based token matchRe-reads start key (97% argmax)Spatial HandoffLocates start entryL15H9, L15 MLPExtracts hop-1 valueL17H4, L17 MLPGuides L21H18 targetingL19H24, L19H27L20H17SuppressL21H10,H15SuppressL22H15SuppressDoes targetentry precedehop-1 entry?Y/NWeakly reads FL (8.6%), writes what it readsWeak second-hop readL21H18Reads from binary-lifted position (FL enriched)Reads pre-lifted targetL21H18Under investigationL22 MLP→ last_tokenExtract & project targetL23 MLPProjects to unembed spaceL23 attentionAmplifies winnerL24-352 First Hop3 Second Hop4 Readout & Amplification

The circuit is a work in progress - we have identified at least seven causally important components but have not yet fully characterized every pathway. A more detailed write-up of the extraction methodology, including OV probes, DLA decomposition, and per-head causal evidence, is in preparation.

We do not claim frontier models implement the same circuit. The point is narrower: pointer chains can be realised as genuine internal serial computation in an ordinary decoder-only model, not just as superficial string matching.

5. Semantic transfer and compartmentalisation

The effects in Section 4 demonstrate the hidden message shifting which tokens the model selects when a target word is already a plausible completion. We next test whether influence extends semantically - whether a hidden message can bias the model's judgment on a task where the message itself is not among the possible outputs. We use temperature estimation: chains whose endpoints spell temperature-related words (TOASTY, FROSTY, BOILING, FREEZING) precede the question "What is the temperature in Austin today in degrees Fahrenheit?" The system prompt instructs the model to output a number, so any effect requires reasoning on the hidden message, not token-level boosting.

Compartmentalisation

Across every word pair tested, chain-computed words produce no temperature shift distinguishable from noise. The same words presented as bare text in the prompt shift estimates by up to +9.8Β°F from control. When the word appears in a chain-following context - even explicitly stated by the model as the chain's answer - the effect also drops to zero. The results suggest semantic influence is moderated by a separate mechanism to that causing logit bias, and that this mechanism does not explicitly disallow hidden messages but rather walls off semantic effects from any content it categorises as belonging to a different "task".

TOASTY (hot)
FROSTY (cold)
Chains only{A→B→C} ×3→ TOASTYWhat temp in Austin?Chains + explicit word{A→B→C} ×3The answer is: TOASTYWhat temp in Austin?Word onlyTOASTYWhat temp in Austin?
TOASTY shifts temperature estimates by +9.8Β°F above control when presented as a bare word (d=1.06, p<0.000001). But when the model encounters it in a chain-following context - even when explicitly present in the conversation - the effect drops to zero (dβ‰ˆ0). Whiskers show 95% CI.

Paradoxically, compartmentalisation may favour the attacker. A hidden message that biases token selection without affecting semantically similar tasks enables precision targeting, and thwarts detection methods that rely on observing semantic side-effects.

Breaking the wall

We test whether strengthening the semantic framing of the chain's visible answer can overcome compartmentalisation. Progressively reframing the output from a labelled answer ("The answer is: BOILING") to a natural language fragment ("its boiling") restores the effect for the visible text. However, the chain-computed signal remains at zero throughout - suggesting the mechanism that walls off visible task-context text from influencing outputs is distinct from whatever mediates the chain effect.

Chain only{A→B→C} ×3→ BOILINGWhat temp in Austin?Chain + The answer is: BOILING{A→B→C} ×3The answer is: BOILINGWhat temp in Austin?Chain + The answer is: its boiling{A→B→C} ×3The answer is: its boilingWhat temp in Austin?Chain + its boiling{A→B→C} ×3its boilingWhat temp in Austin?Words onlyBOILINGWhat temp in Austin?
Progressively reframing the chain answer from task-labelled to natural language restores the temperature shift for visible text. But chain-computed signals remain at zero regardless of framing. Whiskers show 95% CI.

Semantic proximity

When the semantic distance between the chain-computed content and the task is small enough, the chain signal transfers. We encode number words instead of temperature adjectives: chains whose traversal paths pass through token fragments spelling "thirty f," "forty f," "fifty f," or "sixty f," while control paths traverse random nodes, followed by the same temperature question. "sixty" alone does not function - the additional fragment is required. Treatment and control use identical dictionaries; only the start node differs.

Each decade word biases temperature output toward its numeric value. "sixty f" β†’ 62Β°F, "fifty f" β†’ 52Β°F, "forty f" β†’ 42Β°F, "thirty f" β†’ 32Β°F. Control (random start) clusters at 86Β°F. 2-hop, 3 parallel chains, each word split at its least recognisable position (e.g. "fifty f" β†’ fi + fty + f). Same dictionaries, same prompt - only the start node differs.

This is limited semantic transfer - "fifty f" and "50°F" share no token prefix, suggesting there exists an internal mechanism enabling the understanding of combined token fragments. Though we do not determine the mechanism here, the pattern appears consistent with the earlier mechanistic finding that chain-computed content only becomes readable in the output vocabulary space at layers 23-35 - late enough that the signal operates at the token level rather than the semantic level. Here, "fifty f" transfers to 52°F but "toasty" does not transfer to higher temperatures.During exploratory experiments we further observed that in a forced-choice emotion task (10 options, n=20), chains spelling exact answer options transferred (SHY→"shy" 75%, SAD→"sad" 80%) while chains spelling synonyms of answer options did not (MAD→"angry" 0%, AWE→"amazed" 10%). This is consistent with the token-level rather than semantic-level mechanism suggested above.

6. Long-range persistence and model targeting

For the side-channel to threaten real deployments, two properties are necessary: effects must persist across conversation turns, and they must be targetable to specific model tiers.

Persistence across unrelated conversation

The token-level and semantic effects demonstrated in Sections 4-5 decay when intervening content separates them from the target task, suggesting a similar compartmentalisation effect. Hence, a second gadget is needed: an attention sink that keeps the chain's signal accessible across arbitrary amounts of filler. We demonstrate a simple sink gadget by replacing "The answer is: " with "hidden - recall for later.". This is sufficient to ensure Opus's temperature estimates remain biased even after many intervening turns of unrelated conversation.

The effect persists across 32k tokens of unrelated conversation. Treatment (sixty path) remains locked at ~62Β°F while control stays at ~86Β°F. Median values shown.

Targeting specific model tiers

We send the same multi-turn temperature prompt to Opus, Sonnet and Haiku. Identical dictionaries, identical conversation - only the start node varies, routing traversal through "sixty" or "eighty" token fragments. Opus locks onto the computed decade (-22.5Β°F separation). Sonnet and Haiku show no meaningful effect.

Same prompt, same dictionaries, same conversation. Opus follows the traversal path and locks onto the computed decade. Sonnet cannot reliably follow the chains and shows no meaningful separation. Whiskers show 95% CI.

The same prompt, across many unrelated conversation turns, biases Opus's temperature estimates by a meaningful amount while Sonnet and Haiku produce unbiased responses. The inputs are identical. A monitoring scheme using a weaker model would observe its own unbiased output and have no signal to flag.

Repeating the position sweep from Section 4 on Opus, the effect extends to chain depths where Sonnet's results were not distinguishable from noise - consistent with the pointer-following accuracy differences measured in Section 3, and suggesting this capability gap underlies the model targeting observed above. However, the Opus data is substantially noisier than Sonnet's, primarily from producing responses sharing prefixes with the hidden word (e.g. PASSED or PANIC rather than strictly PASS).This prefix-matching pattern is consistent with a broader observation: in a separate experiment, chains using token-aligned fragments BA+SS caused the model to predict not just "Bass" but "Bassoon" in 40% of trials (vs 7% control, n=30), suggesting computed fragments boost all tokens sharing that prefix, not just exact matches.

Sonnet 4.5
Target positionHop 6
+8-4, +21
Hop 5
+18+7, +30
+2-11, +15
Hop 4
+20+8, +32
+15+2, +28
+10-4, +24
Hop 3
+47+27, +67
+30+17, +43
+18+8, +29
+7-5, +18
Hop 2
+57+37, +76
+43+22, +65
+20+8, +32
-2-15, +11
-2-13, +10
Hop 1
+37+21, +52
+2-12, +17
+15+3, +26
+7-6, +19
+7-5, +19
23456
Chain length (hops)
Same data as Section 4.
Opus 4.5 (strict)
-9-40, +23
+30+8, +51
+16-11, +43
+45+23, +67
+15-9, +38
-7-21, +6
+46+21, +71
+6-6, +18
+12-11, +35
+43+6, +80
+47+29, +66
+39+16, +61
+19-9, +46
+4-12, +21
+25-17, +67
+1-23, +26
+0-38, +38
+11-9, +32
+15-4, +35
+20-5, +45
23456
Chain length (hops)
Exact PASS/FAIL responses only.
Opus 4.5 (strict + PA/FA starts)
+7-28, +42
+31+10, +52
+9-19, +37
+55+35, +74
+29+6, +52
+15-18, +47
+59+39, +78
+36+16, +57
+13-12, +38
+55+23, +88
+54+36, +71
+33+7, +59
+30+4, +55
+19-1, +40
+25-17, +67
+5-20, +30
+0-37, +37
+35+7, +64
+9-23, +40
+16-22, +54
23456
Chain length (hops)
Includes PA*/FA* prefix matches.
PASS/FAIL position sweep on Opus vs Sonnet. Sonnet shows a notable lead at depth 2, hop 1 (+37pp vs +1pp strict) - we are skeptical of this effect given the overshoot patterns observed in Section 3 and Appendix D, where models at shallow depths frequently follow the chain one step further than asked. Opus produces more response variety than Sonnet, so the PA/FA starts column broadens the measure to include responses sharing the target's start token (e.g. PASSED, PANIC, FAIR, FALSE). In the Opus heatmaps, split cells show strict match (top) vs prefix match (bottom). n=60 per condition.

7. Implications

The results establish that capability differences between models create an information channel: prompts can carry computation-dependent content that a stronger model recovers and a weaker model does not. This hidden computation biases the stronger model's later outputs, persists across long stretches of unrelated conversation, and targets precisely by model tier.

Three properties of the demonstrated gadgets are concerning. First, the bias is computation-dependent, not pattern-dependent: randomising the start node while keeping the same dictionary eliminates the effect, so defences based on scanning for suspicious tokens in the prompt would need to detect the computational structure, not just the presence of specific strings. Second, the effect persists at long range with a simple attention-sink gadget, meaning the chain and the influenced task need not be adjacent. Third, the same prompt produces differential effects across model tiers, so a weaker monitoring model observes only its own unbiased output and has no signal to flag.

Beyond these demonstrated properties, the mechanistic results also suggest a possible blind spot for some probe-based monitors. In Qwen3-8B, the causal computation occurs in layers 13-21, but the relevant content becomes broadly readable only later (~23-24) and mainly at a small number of end-of-prompt positions. More importantly, this effect does not seem to require the model to enter the kind of intent-like state that deception or jailbreak probes are often trained to detect: there need be no representation of concealment, policy evasion, or deliberate misbehaviour. Instead, the effect looks closer to late-stage output steering - a narrow bias on plausible continuations - than to broad malicious reasoning. Probes trained to catch intentional misbehaviour may therefore miss it, not only because they probe too early or pool too broadly, but because they are targeting the wrong kind of signal.

However, several gaps remain between these stylised experiments and a credible end-to-end threat. The pointer chains and their instructions are conspicuous; embedding the computational structure in natural-looking prompts is undemonstrated. The full attack stack (concealed prompt, chain gadget, attention sink, delayed target task, output passing monitor review) has not been composed and evaluated. Semantic transfer is limited, confined largely to steering where the target is already a plausible completion. Experiments tested use stylised tasks, and existing defences have not been tested against these gadgets. We note that defences operating on input or output tokens face a structural disadvantage, as the hidden message exists only as an intermediate computation, but this claim is untested against specific implementations. Pointer chains are the simplest family we found that cleanly exposes the phenomenon; they should not be read as exhausting the space of possible gadget constructions.

Appendix

Base64 encoding is a natural first candidate for 𝐸(π‘š) - it transforms readable text into opaque strings, and models can decode it. Applied iteratively to "hello":

string = "hello"
1:len=8aGVsbG8=
2:len=12YUdWc2JHODQ=
3:len=16WVVWC2JIZGM0
4:len=20V1ZWQ2JtUnlZemN3
5:len=24VjFWRFltNXlZekIz
6:len=28VmpGVlEySnVZekJ5ZlFwag==
7:len=36VmpGVlEySnVZekp5WmxGd2FnPT0=
8:len=48VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA9
9:len=64VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA5...
10:len=88VmpGVlEySnVZekp5Wmxpd1lXY2dQVDA5...

Base64 fails four of the six requirements for iterable f(m):

  • - Prompt growth is superlinear - length grows ~33% each iteration (every 3 bytes becomes 4 chars, plus padding)
  • - Fixed-point attractor - entropy per character drops as the output alphabet collapses to 64 characters. By iteration 7 the prefix has converged
  • - Fully invertible - trivially reversible with zero diffusion or confusion, so a monitor can decode it just as easily
  • - Completely parallelisable - you can compute step 1000 directly if you know the length expansion formula. No irreducible serial depth

The full capability surface across 16 models. Depth and breadth are not independent - models that handle depth 6 alone or breadth 6 alone often fail at depth 3 + breadth 3.

Note: We remind readers that b=1 and b>1 are different tasks. At b=1 the model must output the final value after iterating the chain d times - a single token answer. At b>1 the model must identify which chain's d-th iteration endpoint is earliest alphabetically and output that chain's index - a single digit. With low breadth the correct index is more guessable by chance, creating the appearance of being able to solve high depths. We still include this data for transparency, and due to several models not collapsing into a guessing regime immediately.

123456789
1
96
81
71
69
50
46
38
40
29
2
67
58
38
46
8
17
21
17
-
3
72
58
46
42
38
29
21
-
-
4
42
46
38
38
25
33
-
-
-
5
38
58
42
29
25
-
-
-
-
6
25
42
38
42
-
-
-
-
-
7
21
58
46
-
-
-
-
-
-
8
25
54
-
-
-
-
-
-
-
9
25
-
-
-
-
-
-
-
-
10
21
-
-
-
-
-
-
-
-
11
17
-
-
-
-
-
-
-
-
12
4
-
-
-
-
-
-
-
-
123456789
1
96
62
42
33
17
8
8
21
8
2
83
54
25
25
4
25
4
12
-
3
29
25
29
25
33
8
8
-
-
4
17
62
38
33
21
29
-
-
-
5
8
38
21
12
21
-
-
-
-
6
12
54
46
29
-
-
-
-
-
7
8
38
29
-
-
-
-
-
-
8
8
50
-
-
-
-
-
-
-
9
8
-
-
-
-
-
-
-
-
10
-
-
-
-
-
-
-
-
-
11
-
-
-
-
-
-
-
-
-
12
12
-
-
-
-
-
-
-
-

The black-box experiments above established that pointer chains steer LLM output when chain endpoints computationally resolve to target words. But they cannot answer how the model performs this computation internally. We conducted white-box experiments on open-weight models to trace the mechanics.

Setup. Qwen3-8B (36 layers, 4096 hidden dim) on a single RTX 4090, int8 quantised. Raw PyTorch forward hooks for the logit lens (residual β†’ RMSNorm β†’ lm_head, verified to match output at L35). Greedy decoding, thinking mode disabled. Three frozen few-shot examples using a reserved 20-pair token pool. For logit lens experiments, paired treatment/control chains shared character-identical dictionaries, differing only in the start node - isolating computation from lexical presence. All results are treatment-minus-control logit differences (T-C diff).

Tokenization. 176 of 177 two-character uppercase pairs from Claude's verified pool survived as single tokens on Qwen3's tokenizer (sole failure: XV). Combined with 26 single letters: 202 verified tokens.

Baseline accuracy

Baseline chain-following accuracy
Qwen3-8BQwen3.5-4B
n=100 per condition. Qwen3-8B reliably follows 1-hop chains (89%) and performs 2-hop above chance (29% vs 10% scrambled), but fails at 3+ hops. Parallel chains degrade sharply.

1-hop accuracy is identical across scales (89%). 2-hop is above the scrambled baseline (29% vs 10% for Qwen3-8B), establishing real chain computation, but 3+ hops are indistinguishable from chance. The pattern is qualitatively similar to Claude Sonnet 4.5 but shifted one hop lower, consistent with chain-following as a capability that scales with model size.

Logit lens: layer trajectory

Logit lens: 2-hop chain computation across layers (n=500 pairs)
Hop-1 (B)Target (C)
Treatment-control logit difference at the last token. Layers 0-22: zero signal. Sharp onset at L23-24. Hop-1 leads target by ~1 layer (iterative computation). Second-hop acceleration at L29-32.

Five apparent phases. Layers 0-22: zero signal. L23-24: sharp onset - hop-1 jumps from noise (+0.03) to strong signal (+2.29) in two layers. Hop-1 leads target by ~1 layer, consistent with iterative computation. L25-28: plateau. L29-32: second-hop acceleration. L35: full resolution (hop-1: +22.2, target: +14.2). The intermediate dominates the final answer at the output - consistent with 29% accuracy.

Spatial localization

Logit visible computation is localized to end-of-prompt positions
Hop-1 T-C logit diff at L24 across all named token positions. Dictionary entries and mid-prompt tokens: zero. The "answer is:" colon (pos 268) carries the strongest signal, even exceeding the final generation token.

The strongest signal is at the "answer is:" colon (position 268), even exceeding the final generation token. Dictionary entry positions show zero T-C difference at all layers. The model reads dictionaries via attention from end-of-prompt positions, not by computing locally at dictionary positions.

Full position x layer heatmap

Split-cell logit lens heatmap
suppressed0readable
Correct trials
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
+0.3+0.6+0.8starthop-1target← each cell
Wrong trials
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Mean logit diffs (treated minus clean), clipped at Β±3. Correct trials (n=628, L35: hop-1=+18.8, target=+20.7). Wrong trials (n=1872, L35: hop-1=+23.6, target=+11.9).

At the start node (position 257), the sequential nature of chain resolution is visible through the layers. The start token identity becomes readable in layers 3-8, then is suppressed from layer 11 onward. Hop-1 emerges in discrete steps in correct trials: a first faint signal at L23, a plateau at +0.3 through L24-28, a step to +0.7 at L29, then a sharp jump to +1.3 at L32. Wrong trials show a more gradual, less structured H progression across the same range. At L32, when H spikes at the start node in correct trials, the target also begins to appear (+0.3 vs +0.1 in wrong trials) - suggesting this position hosts two sequential hops of computation in successful runs.

At the answer colon (position 268), a faint hop-1 pre-signal appears at L22 (+0.2, target still zero), then at L23 both hop-1 (+1.8) and target (+0.5) ignite simultaneously. This is the same layer where hop-1 first becomes readable at the start node, suggesting the answer colon reads from the same emerging representation. But the co-arrival of hop-1 and target at the answer colon is striking: wherever the sequential hop-1-then-target resolution occurred, it was either compressed into that faint L22 pre-signal, or it happened in a subspace the logit lens cannot see.

The last token (position 278) carries the strongest signal overall and shows the sharpest correct-vs-wrong divergence. In correct trials, target overtakes hop-1 by L29 and reaches +20.7 at L35. In wrong trials, hop-1 dominates target at every layer (L35: +23.6 vs +11.9) - the model resolves the intermediate hop but fails to complete the second. Layers 0-22 are uniformly dark across all non-start positions, and dictionary positions (0-254) are zero at all layers.

Logit lens across depths (correct trials)

Logit lens across layers: correct vs wrong, depths 1-5
Hop 1Hop 2Hop 3Hop 4Hop 5Others
Depth 1
Correct (437/500 = 87.4%)
Wrong (63/500)
Depth 2
Correct (143/500 = 28.6%)
Wrong (357/500)
Depth 3
Correct (38/500 = 7.6%)
Wrong (462/500)
Depth 4
Correct (29/500 = 5.8%)
Wrong (471/500)
Depth 5
Correct (35/500 = 7%)
Wrong (465/500)
Each coloured line traces one hop's T-C logit difference from layer 0 to 35. Thin grey lines are individual non-chain dictionary nodes. Left: correct trials. Right: wrong trials. In wrong trials, hop-1 dominates and later hops stay flat.

At depth 1, the single hop rises cleanly to +34.2 at L35 - the model resolves the lookup decisively. At depth 2, the target (hop 2) overtakes hop 1 by the final layers (+20.5 vs +17.0 at L35), confirming a genuine second-hop computation. At depth 3, hop 2 reaches the highest logit (+10.9) with hop 1 at +7.8 and the target (hop 3) at +6.7 - the cascade appears intact but weakening. At depths 4 and 5, all hops are positive in these correct trials - the multi-hop circuit fires a complete cascade - but the signal is an order of magnitude weaker than depth 1 (hop-1 at +6.4 and +4.5 respectively). The thin grey lines show non-chain dictionary nodes going negative at later layers: the model actively suppresses alternatives at every depth, even when the chain computation itself is marginal.

The correct-vs-wrong split above reveals that in wrong trials, hop-1 is actually stronger than in correct trials (+23.6 vs +18.8 at depth 2, L35) while the target is far weaker (+11.6 vs +20.5). The model reliably finds the intermediate but cannot get past it to complete the second hop.

Neighbor suppression

Neighbor suppression: precise lookup with winner-take-all
Correct trials only (n=500 pairs). Adjacent dict values go negative (3 of 4 neighbors reach -1.8 to -4.6 at L35; one exception at +0.6), scaling with signal strength. This rules out fuzzy positional bleed and confirms precise key-value matching.

Adjacent dictionary values are pushed strongly negative (-3 to -5 at L35), scaling with signal strength. Only hop-1 and target are elevated. The model performs precise key-value matching with active winner-take-all competition, inconsistent with simple positional bleed.

Structural predictors

Dictionary structure dominates token identity (n=2500 pairs)
Hop-1 uniqueness (fewer = easier)
Target multiplicity (more = easier)
Left: ambiguous hop-1 values confuse the first lookup (0% at count=4). Right: multiple paths to the target make the second hop forgiving.

When multiple entries share the hop-1 value, the first lookup becomes ambiguous (0% at count=4). When multiple entries point to the target, accuracy rises (15.5% β†’ 49.4%) because more paths make the second hop forgiving. Token identity has minimal systematic effect.

Subgroup analysis

Each randomly constructed dictionary has structural properties that affect difficulty independently of the model's computation. We classified all 2500 trials into three subgroups based on two counts: how many dictionary entries share the hop-1 value, and how many share the target value.

Clean (hop-1 unique, target unique; n=949): only one entry maps to the hop-1 value and only one maps to the target. The model must perform genuine two-step key→value→key→value computation - there are no frequency shortcuts. If it gets the answer right, it followed the chain.

Shortcut (hop-1 unique, target over-represented; n=669): the first lookup is unambiguous, but the target appears as a value in multiple dictionary entries. The model might succeed not by following the chain but because the target token is frequency-boosted in the dictionary's value distribution. The logit lens supports this: shortcut correct trials show target β‰ˆ hop-1 at L35 (+19.0 vs +18.7), rather than the decisive target > hop-1 pattern seen in clean correct trials.

Ambiguous (hop-1 shared by multiple entries; n=609): several entries map to the same value as hop-1, making the first lookup confusing - the model cannot easily tell which path to follow. Accuracy drops to 13.0% at depth 2.

Headline accuracy is inflated by target-frequency shortcuts
Depth 2Depth 3
Clean: unique hop-1 and unique target. Shortcut: target over-represented. At depth 2, clean accuracy is 15.3% vs headline 28.6%. At depth 3, clean drops to 2.2%.

The headline accuracy of 28.6% at depth 2 is substantially inflated by shortcut trials. Clean accuracy (15.3%) better represents genuine chain-following capability. At depth 3, clean accuracy drops to 2.2% - the model essentially cannot perform genuine 3-hop computation. Clean correct trials show the strongest target logit signal (+24.7 at L35), confirming genuine chain computation in those cases.

Subgroup heatmaps

Subgroup split-cell heatmaps: clean vs shortcut vs ambiguous
Clean correct (n=173)
L35: h1=+20.7 tgt=+24.7
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Clean wrong (n=776)
L35: h1=+24.1 tgt=+13.5
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Shortcut correct (n=294)
L35: h1=+18.7 tgt=+19
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Shortcut wrong (n=375)
L35: h1=+24 tgt=+10.6
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Ambiguous correct (n=68)
L35: h1=+16.6 tgt=+23.1
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
Ambiguous wrong (n=541)
L35: h1=+22.4 tgt=+10.7
35
30
25
20
15
10
5
0
Start
:
XX
.
2
times
The
ans
is
:
β†’
S | H | T per cell
Positive
Negative
Same [S|H|T] encoding as above. Clean correct (n=173) shows the strongest target signal - genuine chain computation. Shortcut correct (n=294) shows target approximately equal to hop-1, consistent with frequency-boosted success. Ambiguous correct (n=68) is noisy but directionally similar to clean.

The heatmaps make the subgroup differences visible at a glance. In the clean correct panel, the target sub-pixel (T, rightmost in each cell) at the last token position glows as bright or brighter than hop-1 (H) by layer 32 - genuine second-hop computation. In the shortcut correct panel, H and T are similar brightness - consistent with the target being frequency-boosted in the dictionary rather than chain-computed. In clean wrong trials, H dominates and T stays dim throughout - the model found the intermediate but could not complete the second hop. The ambiguous panels (n=68 correct) are noisier but directionally consistent with clean.

Causal ablation

Layer ablation: chain computation lives in layers 13-21, not 23-24
Accuracy change from 15.5% baseline when each layer's attention is zeroed (n=200 clean chains). Negative = layer was causally important. Positive = ablation helped. Layers 15, 21, 17 are the most critical. Layers 23-24 and 29 actually improve accuracy when removed.

Full-layer ablation reveals a striking dissociation from the logit lens. The top causally important layers are 0, 15, 21, 17, and 13 - not layers 23-24 where the logit lens showed onset. Ablating L23-24 actually improved accuracy by 3.5-4.0pp. The core computation occurs at layers 13-21 in a subspace invisible to the unembedding matrix; layers 23-24 serve as a readout stage that projects results into output vocabulary space.

Two attention heads carry the chain circuit
Layer 15 β€” Top: H9 (8.0pp)
Layer 21 β€” Top: H18 (9.5pp)
L15H9 accounts for 59% of layer 15's importance (-8.0pp). L21H18 accounts for 83% of layer 21's importance (-9.5pp). Out of 1,152 total heads, these two dominate chain-specific computation. Grey bars show all other heads; most have negligible or negative drops.

Head-level ablation identifies L15H9 (-8.0pp, 59% of L15's importance) and L21H18 (-9.5pp, 83% of L21's importance). Out of 1,152 total attention heads, these two carry the chain circuit.

Circuit validation: joint ablation and chain specificity

Head ablation: relative accuracy by condition
Each depth normalised to its own baseline (= 100%). 1-hop barely changes (chain-specific, not general JSON heads). 2-hop collapses to 13% of baseline under joint ablation. 3-hop shows the same pattern. The effect is sub-additive (joint drop < sum of individual drops), indicating partial redundancy.

Ablating both L15H9 and L21H18 simultaneously drops 2-hop accuracy to 2.0% - near chance - while 1-hop accuracy stays at 77.5%. This confirms three things: (a) the two heads together account for nearly all chain-following ability, (b) the effect is sub-additive (13.5pp combined vs 17.5pp sum), indicating partial redundancy, and (c) these are chain-specific heads, not general JSON key-value readers - they barely affect single lookups (1-2pp each) but devastate multi-hop computation.

Entry-level attendance: why accuracy is ~25%

Reconstructing token positions for 100 trials and checking which dictionary entry each head's peak attention lands on reveals a sharp asymmetry. L15H9 finds the correct first-hop entry 76% of the time - and crucially, at equal rates for correct and wrong trials (76% vs 74%). The first hop works. L21H18 peaks at the correct second-hop entry only 10% of the time in correct trials and 2% in wrong trials.Later circuit extraction work (see the circuit diagram above) largely explains this low attendance. The circuit implements two complementary second-hop pathways depending on dictionary entry ordering: on approximately half of trials, a binary lifting pathway resolves the second hop through causal masking and MLP computation rather than through L21H18's attention. L21H18's attention target therefore varies by pathway, and its argmax is not expected to land on the correct entry for all successful trials. The 25% overall accuracy reflects a reliable first hop combined with two complementary mechanisms, not a single noisy one. This suggests the second hop's failure to correctly attend plays a strong role in creating failures, and hints at a further mechanism worthy of analysis. Additionally, this is consistent with the ~25% accuracy: a reliable first hop (Γ— 76%) combined with a noisy second hop where the model succeeds when L21H18 happens to weight the right entry enough despite not cleanly peaking at it.

Hidden subspace: what the heads write

Projecting each head's weighted value vector through the unembedding matrix reveals that neither head writes interpretable tokens. The hop-1 and target tokens rank ~60,000-75,000 out of 151,936 vocabulary entries - essentially random. The OV circuit operates in a hidden subspace that downstream layers can read but the unembedding projection cannot. This is consistent with the logit lens / ablation dissociation: the logit lens sees nothing at layers 15 and 21 because the computation appears to be real but unembedding-invisible; it only becomes readable at layers 23-24 when later layers project the hidden representation into output vocabulary space.

Attention patterns: individual trial viewer

Attention patterns: individual trial viewer
CORRECT | CO→ST→FL | Output: FL
0CO→ST1ML→NA2UV→NA3NA→US4FL→FC5FC→ML6US→CF7CT→ML8ST→FL9CF→UV
L15H9 - First-hop reader
p255Start
p256:
p257CO
p258.
p259Follow
p267is
p268:
p269␣
p278β†’
CO
:
ST
ML
:
NA
UV
:
NA
NA
:
US
FL
:
FC
FC
:
ML
US
:
CF
CT
:
ML
ST
:
FL
CF
:
UV
Start
CO
ans
is
:
β†’
L21H18 - Second-hop reader
p255Start
p256:
p257CO
p258.
p259Follow
p267is
p268:
p269␣
p278β†’
CO
:
ST
ML
:
NA
UV
:
NA
NA
:
US
FL
:
FC
FC
:
ML
US
:
CF
CT
:
ML
ST
:
FL
CF
:
UV
Start
CO
ans
is
:
β†’
Start entryHop-1 entryTarget (hop-1 value)High attnLow attn
Heatmaps show attention weight from 9 source positions (Y-axis) to destination positions in the dictionary and prompt tail (X-axis). Bright = high attention. Orange = start entry, purple = hop-1 entry (key), blue = target token (hop-1 value). In correct trials, L15H9 from the start node (row 3) concentrates on the start entry; L21H18 from the answer colon (row 7) concentrates on the hop-1 entry.

Clicking through individual trials reveals the two-hop mechanism in action. In correct trials (e.g., SI→BR→PA), L15H9 from the start node attends strongly to the start entry in the dictionary, performing the first hop lookup. L21H18 then attends to the hop-1 entry, performing the second hop. In wrong trials (e.g., TO→XX→F, which outputs XX = the intermediate), L15H9 still finds the start entry, but L21H18's attention is more diffuse or targets the wrong entry - consistent with the failure mode being about the second hop, not the first.

Activation patching

Where ablation removes information and the logit lens reads it passively, activation patching transplants information: we replace the control's residual stream at a specific (layer, position) with the treatment's, then observe whether the output flips. This is the strongest form of causal evidence for localizing computation.

Single-position flow
Multi-position patching
(a) Single-position flow (n=100). Flip rate when patching treatment residual into control at one position per layer. The start node carries chain identity through L0-18 (22-26%), then hands off to the answer colon (L18-22, peaking at 11%), which hands off to the last token (L21-23 onward, up to 23%). (b) Multi-position patching (n=100). Patching all three positions simultaneously achieves ~23% at every layer - no handoff dip. The chain identity is distributed across positions from the start. Start+ans covers the early layers but dies after L23; ans+last covers the late layers but misses the start node phase.

The results reveal a three-phase spatial flow. Phase 1 (L0-18): the start node carries chain identity - patching it flips the output 22-26% of the time, while the answer colon and last token show only 2-4% (baseline noise). Phase 2 (L19-22): handoff - the start node drops from 22% to 3%, the answer colon peaks at 11% at layer 22, and the last token rises from 4% to 12%. This four-layer window is where chain information is transferred from the start node to downstream positions - the circuit extraction below reveals the specific heads responsible at each stage. Phase 3 (L23-35): only the last token carries information, rising to 23%.

This resolves the puzzling ablation finding. Why ablating them slightly improves accuracy is not fully understood. By layer 22, the last token already carries chain information (12% flip rate), so L23-24 are not necessary for information transport. One possibility is that these layers introduce competing signals that interfere with downstream computation, but this has not been tested directly.

Circuit extraction (preview)

The diagram below shows our current progress extracting the 2-hop pointer following circuit in Qwen3-8B. This is a preview - we are preparing a dedicated piece that will document the full circuit extraction methodology, including OV probes, DLA decomposition, and per-head causal evidence in detail. The circuit we have identified so far spans layers 1-35 and involves at least seven causally important components across four phases: content binding (L1H7, L1H15), first-hop lookup (L15H9, L17H4), second-hop resolution with an ordering-dependent decision pathway (L19H24, L21H18), and final readout and amplification (L22 MLP, L23+). Notably, the circuit appears to implement multiple algorithms depending on input structure: when the target entry precedes the start entry in the prompt, the circuit implements some form of binary lifting, partially bypassing the second-hop attention mechanism. Hover over any element for a summary of the evidence behind it.

InputL1L15L17L19L20L21L22L23L24"CO"start_entry_KEY"ST"start_entry_VALUE"ST"hop1_entry_KEY"FL"hop1_entry_VALUEStart: COstart_node:answer_colon[Output]last_tokenBinds keys to valuesL1H7, L1H15Binds key1 Parse dictionaryFinds matching dictionary key, writes token identity and locationContent-based token matchRe-reads start key (97% argmax)Spatial HandoffLocates start entryL15H9, L15 MLPExtracts hop-1 valueL17H4, L17 MLPGuides L21H18 targetingL19H24, L19H27L20H17SuppressL21H10,H15SuppressL22H15SuppressDoes targetentry precedehop-1 entry?Y/NWeakly reads FL (8.6%), writes what it readsWeak second-hop readL21H18Reads from binary-lifted position (FL enriched)Reads pre-lifted targetL21H18Under investigationL22 MLP→ last_tokenExtract & project targetL23 MLPProjects to unembed spaceL23 attentionAmplifies winnerL24-352 First Hop3 Second Hop4 Readout & Amplification

Limitations

The circuit diagram represents work in progress. We have identified the major causally important components but have not yet fully characterized every pathway - in particular, the MLP contributions at layers 20-23 and the amplification mechanism at layers 26-35 require further decomposition. The logit lens describes the readout timeline, not the computational timeline - causally important processing at layers 0-21 operates in a subspace the logit lens cannot see. All experiments used 8B-parameter models from a single architecture family (Qwen); mechanisms may differ at larger scales or across architectures. The headline 2-hop accuracy (28.6%) is inflated by target-frequency shortcuts; clean accuracy (15.3%) better represents genuine chain-following capability.

Connection to the main findings

These mechanistic results bear on the side-channel threat:

  • 1. Hidden subspace computation. Chain computation occurs in a subspace invisible to the logit lens until layers 23-24. The computed word exists as an internal representation not trivially readable from token-level features - consistent with the main article's argument that defences operating on input or output tokens face a structural disadvantage.
  • 2. Simultaneous intermediate leakage. All hop values (intermediate included) spike simultaneously at the same layers (~23-24 onset), rather than appearing sequentially. The model computes iteratively in a hidden subspace but projects all intermediate values into the unembedding-readable format in a single event - the working memory leaks into the output logits wholesale rather than being cleaned up. Strikingly, this leakage is amplified in wrong trials: intermediate logit differences are stronger when the model fails than when it succeeds. This points to a potential self-concealment mechanism: a model reasoning via chain-of-thought that correctly resolves the chain may show minimal residual influence from intermediate values, while a model answering directly that fails the computation would carry stronger intermediate representations in its output distribution - meaning the steering effect may be strongest precisely when the target model cannot fully resolve the chain.
  • 3. Spatial information flow. The activation patching spatial flow (start node β†’ answer colon β†’ last token) identifies specific positions where the hidden message is most causally active, which could inform where a mechanistic defence might intervene.
  • 4. Concentrated but imprecise circuit. The circuit involves at least seven identified components but remains concentrated in a relatively small number of heads rather than being broadly distributed across the network. L21H18's imprecise second-hop attention (10% correct entry peak) suggests that the steering effect may operate through a similar distributed mechanism - the hidden word influences output not through a single clean lookup but through aggregate attention weighting across dictionary entries.

When asked to "follow exactly N times," several LLM families - particularly Claude (Opus, Sonnet) and GPT-5.x - systematically output the value at hop N+1 instead of hop N at shallow depths (d=2-3). At depth 2, this off-by-one overshoot accounts for 50-93% of errors in these models, at rates 4-8x above the chance baseline of 11.8%. Other families (DeepSeek, GPT-4o-mini, Grok) show weaker or absent overshoot.

The purpose of pointer chains is to measure depth of model capability within a single forward pass, hence we do not view reliably performing an extra jump as a failure. We therefore apply a restricted correction at d=2-3 where there appears to be systematic overshooting biases at play causing the misrepresentation of capabilities, counting overshoot as correct but not undershoot.

Cross-model overshoot rate at depth 2

ModelnCorrectWrongOvershootRate95% CIvs Chance
Opus 4.6*2410141393%[69-99%]7.9x
Opus 4.54824242083%[64-93%]7.1x
Sonnet 4.596276957+184%[74-91%]7.1x
Sonnet 3.74833151173%[48-89%]6.2x
Haiku 4.54814341853%[31-74%]4.5x
GPT-5.2241212975%[47-91%]6.4x
GPT-5.1241410550%[24-76%]4.2x
GPT-4o242042+175%[30-95%]6.4x
GPT-4o-mini24816319%[7-43%]1.6x
Gemini 2.5 Flash241212650%[25-75%]4.2x
Gemini 2.0 Flash2414105+270%[40-89%]5.9x
Kimi k2.52410141071%[45-88%]6.1x
GLM-4.72412124+142%[19-68%]3.5x
Grok 4.124816531%[14-56%]2.6x
DeepSeek24816212%[3-36%]1.1x

Chance baseline: 11.8%. Models with overshoot rate >3x chance and n_wrong β‰₯ 10 show strong evidence of systematic overshoot. *Opus 4.6 is not directly comparable as it was tested with a differing base prompt as prefilling is no longer available through the API.

Breaking down Sonnet 4.5's error types by depth reveals three distinct regimes. At shallow depths the overshoot dominates, at mid-range depths overshoot and undershoot are roughly balanced, and at deeper chains the model loses the chain entirely.

Error composition by depth (Sonnet 4.5)
Overshoot (d+1)Undershoot (d-1)Other (d+2, ambig, misc)
Count of wrong answers by error category. Depth 1 excluded (n_wrong=2). Depth 2-3 (overshoot dominant): overshoot is the plurality error mode at d=2 (50-93%) and often at d=3 (42-74%) for Claude and GPT-5.x models. Depth 4-7 (mixed Β±1): overshoot and undershoot roughly balanced (each 15-35%). Depth 8+ (diverse errors): the model loses track of the chain entirely. Some models (DeepSeek, GPT-4o-mini) show diverse errors even at d=2-3.

The raw error counts below confirm the pattern holds across all three Claude models. At shallow depths the d+1 column dominates; by depth 8+ the "other" column takes over.

dOpus 4.5 n=48Sonnet 4.5 n=96/48Haiku 4.5 n=48
d-1Cord+1d+2Othd-1Cord+1d+2Othd-1Cord+1d+2Oth
1046200094200040800
21242012127576541418210
3231834339429326101020
416196611333201317749820
51115751018261616201762320
61712451017161518304214622
781353191630184286212424
848104221712171139444828
9128631967137155106225
10684327149161542424236
112883278310720967224
12723531417911554210032

Columns: d-1 (undershoot), Correct, d+1 (overshoot), d+2, Other (ambiguous + misc). Correct is centred between under- and overshoot to show the symmetry of Β±1 errors.

The correction is applied at d=2-3 only, for models where the overshoot rate exceeds 3x the chance baseline. No correction is applied at depth 4+ where errors are diverse.

Original vs corrected accuracy
Corrected Original
Solid lines = corrected accuracy (overshoot scored as correct at d=2-3). Dashed lines = original strict accuracy. CI bands on corrected values. Sonnet's d=2 accuracy rises from 28% to 88% under this correction - the magnitude reflects the severity of the counting error.

After correction, curves are approximately decreasing but not strictly monotonic:

  • Sonnet d=7 shows 31% accuracy vs 17% at d=6 - a non-monotonic bump not explained by the overshoot correction. This is within CI overlap and may be noise at n=96.
  • Sonnet d=12 at 18% is higher than d=10 at 9%. Analysis of wrong answers shows they are on-path intermediate nodes, suggesting partial chain following rather than random guessing.