writing

Long context is a control problem

2026-05-03ai rlm long-context

Three runtimes — RLM, λ-RLM, Claude Code — tested head-to-head on the same long-context tasks on Haiku 4.5. The architectural question underneath them all is where the recursion lives. λ-RLM was the only runtime to find the Pequod's first ship; its cost-bound math was broken in the implementation; RLM without RL training wastes its own affordance. What Prime Intellect is actually betting on, and the open research question.

Every long-context architecture is a bet about which actor in the system gets to decide how to recurse over the document. The decision can live in five places: inside the model's window, inside a fixed combinator chain, inside an LLM-driven Python loop, inside a tool-using agent's runtime, or outside the runtime entirely in the task spec. Each position trades a different property of the system for a different one, and the failure shapes that emerge from those trades are the actual subject of this post.

I spent a day testing three of those positions head to head. RLM, λ-RLM, and Claude Code, on the same model and the same tasks. That was the first cut. A reader pointed me at Theodoros Galanos's Recursive by Design, which argues for a fourth position I had missed, and reading him against my own piece I noticed several findings I had under-supported. So I ran the experiment again. The second cut patched a bug in λ-RLM, added Plan-First as a fourth runtime, built an anchor-presence ablation to test what I was claiming about question shape, and tried in-context priming as a cheap proxy for the "RL training would fix this" claim.

The rerun changed three of my findings and surfaced two more I had missed. This is the corrected field log.

The Pequod question

Each runtime got the same prompt against the same document. The question is what is the first ship the Pequod meets?, the document is a 1.2-million-character copy of Moby-Dick, and the model in every case is Claude Haiku 4.5 with identical compute budget. The gold answer is Albatross.

RLM           →  "Goney/Albatross"                                            (CORRECT in the rerun,
                                                                                 "gave up" in the original)
Claude Code   →  "The Jungfrau"                                               (wrong; that's Chapter 71)
λ-RLM         →  "the Goney (or Albatross), followed by the Town-Ho..."       (CORRECT; Goney is the
                                                                                ship's name on first
                                                                                introduction in Ch. 52)

Two runtimes land on the right ship. One commits confidently to the wrong one. The interesting question is why this particular distribution showed up, and what it says about where the recursion lives. RLM's flip from "gave up" in the first run to "correct" in the rerun is itself one of the findings, covered below.

flowchart LR Spec["task spec<br/>(Plan-First)"] --> Math["fixed combinator chain<br/>(λ-RLM)"] Math --> Model["LLM-driven Python loop<br/>(RLM)"] Model --> Tools["tool-using agent<br/>(Claude Code)"] Spec -.->|most constrained| Tools style Spec fill:#0a1018,stroke:#48c6ff,color:#cfd6df style Math fill:#0a1018,stroke:#3dffb5,color:#cfd6df style Model fill:#0a1018,stroke:#b993ff,color:#cfd6df style Tools fill:#0a1018,stroke:#ffb85c,color:#cfd6df

The four bets

The recent literature collapses several different ideas under shared names. "Recursive language model" is the worst offender; "structured long-context system" is the runner-up. Disentangling them is the prerequisite for any comparison.

Four architectures for long context: Plan-First with task-spec-derived decomposition, λ-RLM with typed combinators, RLM with Python REPL and sub-LLMs, Claude Code with file-system tools.

Plan-First is the position Galanos argues for, and which I added as a fourth runtime for this experiment. The decomposition gets computed from the task spec before any leaf LLM call runs. A structured-report task uses the template's dependency tree. Code QA uses the function-definition boundary structure. A needle-in-haystack timeline uses the temporal window structure. The model only runs leaf prompts; everything structural happens in code. The bet: the structure of the work is computable a priori from the task spec, so making the model rediscover it every run wastes both calls and reliability.

λ-RLM comes from a paper where the decomposition is a fixed combinator chain: Split → Filter → Map → Reduce. The plan (k*, d, τ*) is computed by math before any leaf LLM call runs, from a lookup table on document size and task type. The LLM only ever sees one of two prompt templates, the relevance filter or the leaf QA template. The bet: typed structural control delivers proofs about termination, cost bounds, and depth, properties that depend on the control structure being decidable before any model call runs.

RLM is the architecture in Alex Zhang's October 2025 blog post and the one Prime Intellect is championing as the paradigm of 2026. The model gets a Python REPL. The long input lives as a variable in that REPL. The model writes code (peek, slice, grep, summarize, recurse via llm_query) and decides decomposition strategy at runtime. The bet: give the model the affordance, then train it via RL to use the affordance well.

Tool-using agent harness is what Claude Code and Codex are. Model plus a small system prompt plus Read, Grep, Bash. The long context lives on disk. The model navigates rather than ingests. The bet: a frontier model with good tools delivers practical long-context performance today, without inventing a new scaffolding.

These sit on a single axis, despite the surface diversity: how much of the orchestration logic is the model's choice. Plan-First is the most constrained; the model never decides anything structural. λ-RLM is next; the model picks YES/NO on chunks and answers leaf prompts. RLM sits in the middle; the model has freedom inside a single recursive scaffold. Claude Code is the most adaptive; every step is a model decision.

These are the four bets the field is currently making.

What I ran

The setup, both cuts together:

Model. Claude Haiku 4.5 as the leaf model for all four runtimes. No RL training on any of them; they all use the off-the-shelf API model.
Original samples. The largest sample (the 128k bin) from each of oolong (single-doc QA), sniah (sequential needle-in-haystack), and codeqa (code understanding) in LongBench-v2.
Anchor ablation. Each base sample paired with an anchor-helpful and anchor-absent question variant. Same gold per pair, paraphrased question wording.
λ-RLM context window. K = 750,000 chars, close to Haiku's actual 200k-token window. Run after patching _Phi to honor plan.depth as an explicit recursion budget (Finding 2 below).
Claude Code. claude --bare --print --output-format stream-json --allowed-tools=Read,Grep,Glob,Bash. The --bare flag strips hooks, skills, MCP, and CLAUDE.md auto-discovery, so the comparison measures the runtime alone.
Per-call traces plus budget tracker. Every leaf call routes through a TracingAnthropicClient that records exact input/output tokens, real dollar cost, latency, and prompt head/tail to a JSONL. Per-mod and total caps abort the run if exceeded.

The three base samples:

Dataset	Doc chars	Question (anchor-absent variant)	Gold
oolong/128k	1,200,000	"What is the first ship the Pequod meets?"	Albatross
sniah/128k	371,686	"List what Steven Rodriguez did during 11, 2004 in chronological order"	9 events spanning Nov 7–29
codeqa/128k	1,200,000	"Which keyword arguments are recognized by the function that calculates the derivative?"	h, method, direction

oolong and codeqa exceed Haiku's window (≈300k tokens). sniah fits.

Headline numbers

Four runtimes, three samples, after the patches. Wall time, call counts, and substantive correctness graded manually because the bench's token-F1 metric is broken for free-form output.

                  oolong-128k         sniah-128k          codeqa-128k
                  -----------         ----------          -----------
RLM               10 calls / 49s      ~7 calls / 26s      4 calls / 21s
                  CORRECT             refused             gave up
λ-RLM (patched)   40 calls / 67s      2 calls / 6s        36 calls / 113s
                  CORRECT             refused (off-topic) PARTIAL/CORRECT (was "wrong" pre-patch)
Plan-First        24 calls / 43s      14 calls / 24s      110 calls / 173s
                  gave up             1 of 9 events       CORRECT (2/3 gold + 1 valid)
Claude Code       24 calls / 59s      5 calls / 12s       35 calls / 73s
                  wrong (Jungfrau)    partial (1 of 9)    partial (2 of 3 args)
                                                          ($0.21 total cost
                                                           across all 3 CC runs)

Five things in this picture were surprising enough to be worth a finding each.

Finding 1: the runtime's strategy decides anchor sensitivity

The Pequod's first encounter is Chapter 52, titled "The Albatross." The chapter title carries no mention of the Pequod meeting anything. Chapter 71, much later, is titled "The Pequod Meets the Jungfrau." Grep on "Pequod meets" lands you in Chapter 71 and commits you to the Jungfrau. That is what Claude Code did, and it produced a confidently wrong answer in 24 grep-and-read calls.

The first version of this post drew a clean conclusion: coverage beats search when the question carries no anchor. λ-RLM's structural coverage was robust to the misleading anchor because the runtime never grepped on the question's words. The rerun's anchor-presence ablation tested whether that pattern generalizes.

For each of the three base tasks I built two question variants. Same gold per pair. One variant carries a literal surface anchor near the answer's location in the document. The other paraphrases the question to strip surface forms.

pair	anchor-helpful	anchor-absent
pequod	"In Chapter 52, the Pequod has its first encounter with another whaling ship. What is its name?"	"What is the first ship the Pequod meets?"
rodriguez	"List what Steven Rodriguez did during November 2004..."	"There is exactly one individual whose activity is documented across the first three weeks of November 2004. List that person's activities..."
derivative	"Which keyword arguments are recognized by the function that calculates the derivative?"	"Locate the function that estimates the rate of change of a single-variable function using finite differences..."

I wrote my predictions down before running anything. I expected the anchor-helpful variants to land more answers, especially for runtimes whose strategy can grep on a literal string.

The actuals broke the prediction in 5 of 9 cases, in the same direction:

pair	runtime	anchor-helpful	anchor-absent
pequod	RLM	"Rose Bud" (wrong)	gave up
pequod	λ-RLM	"Bachelor" with "chapters shown begin at 116"	"Goney/Albatross" (CORRECT)
pequod	Plan-First	"Samuel Enderby" (wrong)	"Rachel" (wrong)
rodriguez	RLM	"physics paper, no Rodriguez"	finds Rodriguez, lists events
rodriguez	λ-RLM	physics-paper summary (off-topic)	physics-paper summary (off-topic)
rodriguez	Plan-First	Nov 29 only (1 of 9)	Nov 11 only (1 of 9)
derivative	RLM	identifies `_zetasum` (wrong fn)	identifies `fd` (wrong fn)
derivative	λ-RLM	partial (one kwarg)	`diff` with correct kwargs (CORRECT)
derivative	Plan-First	"derivative, n, direction, method"	"n, direction, method"

The Pequod helpful question reads "In Chapter 52..." and λ-RLM responded with "the chapters shown begin at 116." Same context, same model, same runtime, two question wordings, two different chunks accepted by the filter. The literal phrase "Chapter 52" carried less weight inside the filter's content model than the question's surrounding rhetoric.

On Rodriguez, the helpful question carrying "Steven Rodriguez" was a worse prompt for both RLM and λ-RLM than the absent version. The document is mostly a physics paper with diary entries scattered through it. Both runtimes summarized the physics. The absent question pushed RLM into a structural search for what kind of person could fit, and that path led to the diary.

The right framing is this: the property that matters is whether the runtime's strategy actually uses the question's surface form. λ-RLM's structural coverage is robust to anchor variation because its filter is content-level rather than surface-level. Claude Code's grep strategy is sensitive to anchor presence, which becomes a liability when the anchor is misleading (the Pequod case). The original framing reached for a more general property than the data supports.

The Pequod sample stays a clean demonstration of the original claim. The generalization shifts to the runtime's strategy rather than the question's wording.

Finding 2: the λ-RLM bug suppressed correctness as well as inflating cost

In the first cut, λ-RLM with K=750k and the same 1.2M-char document showed two very different execution profiles on two datasets:

oolong/128k  (Moby-Dick prose)  →   40 calls /  67s
codeqa/128k  (Python source)    →  563 calls / 1372s   (≈23 min)

The codeqa run took 14× longer than the oolong run despite identical configuration and identical document size. The math said they should be the same. The planner output for codeqa was identical to oolong's:

[λ-RLM] n=1,200,000 chars  |  task=qa  |  k*=20  τ*=60000  d=1

Same plan, dramatically different execution. The defect was in _Phi's recursion structure:

Comparison of planner vs actual execution. The planner says k*=20, τ*=60000, d=1, expecting ~41 calls. But _Phi's recursion bottoms out on len(P) ≤ τ*, not on a depth counter. _Split's word-boundary snapping with ±20% margin can produce chunks larger than τ*, which fall through to recursive splitting. Effective depth becomes 2 for many branches, blowing up to 563 calls and 23 minutes.

The recursion bottomed out on len(P) ≤ τ_star alone. _Split does word-boundary snapping with a ±20% margin, which can produce chunks up to τ* × 1.2 = 72,000 chars. Those exceed τ* and fall through into the recursive branch, splitting again into 20 sub-chunks of ~3k chars each.

The fix is mechanical. Thread plan.depth through _build_phi_code as an explicit budget, and have _Phi(P, d) bottom out on d ≤ 0 or len(P) ≤ τ:

# Before
def _Phi(P):
    if len(P) <= τ:
        return leaf(P)
    else:
        return _Reduce([_Phi(c) for c in _Split(P, k)])

# After
def _Phi(P, d):
    if d <= 0 or len(P) <= τ:
        return leaf(P)
    else:
        return _Reduce([_Phi(c, d - 1) for c in _Split(P, k)])

(Full patch at patches/0001-fix-phi-depth-budget.patch; I am sending it upstream.)

The patched run on codeqa-128k:

                  baseline   patched   Δ
calls             563        36        15.6× fewer
wall              1372 s     113 s     12.1× faster
answer            wrong      finds diff(), lists 2/3 gold kwargs + 1 valid

A 15× call reduction was the expected outcome. The answer changing was the surprise. The original run reported λ-RLM on codeqa as "wrong (named wrong functions)." With the bug, every chunk was getting recursively split a second time into ~3,000-char fragments. The leaves were starving, working with too little context to identify which function actually answered the question, so they produced plausible-sounding guesses about whichever fragment they got handed. The Reduce step compounded those guesses into a confidently wrong final answer.

With depth honored, the leaves see 60k-char chunks. That is enough context to identify diff(), parse its keyword arguments, and answer correctly. The patched runtime returns n, direction, method. The gold lists h, method, direction. Score: 2 of 3 gold arguments matched, plus one (n) that is also a valid kwarg of diff() and was absent from the gold only because the gold was hand-curated. By any honest manual grading, that is a correct answer.

The framing has to shift. The cost-bound violation was costing more than wall time. It was hiding the runtime's actual capability. Whatever you take away about λ-RLM's adaptivity-vs-guarantees trade-off has to be re-weighted: the guarantees are stronger than the first cut implied (the bound holds when honored), and the correctness on code-shaped inputs is materially better than the original data showed.

This is the cleanest correction from the rerun. It is also the one I would have caught the first time if I had treated the cost blowup as a warning about correctness rather than as wall-time noise.

Finding 3: Haiku's failure shape is per-task, and priming leaves the gap open

The original Finding 3 said off-the-shelf models waste the RLM affordance. The trace from the first cut was characteristic: Haiku printed context[:500], then context[:2000], then called llm_query("summarize this", context[:5000]), then llm_query("analyze entire text", context) sending the whole document back to itself. On the big oolong sample, the model gave up at iter 3 because the "Question:" marker was buried in the giant prompt and Haiku stopped before navigating to find it.

I re-ran the same configuration on the same three samples. Same Haiku model, same RLM scaffold, same max_iterations=6. One result failed to replicate:

                  run-003 (original)         run-004 (rerun)
oolong-128k       "gave up at iter 3"        "Goney/Albatross"  (10 calls / 49s)
                                              CORRECT
sniah-128k        partial (1 of 9)           refused
                                              ("technical difficulties")
codeqa-128k       wrong (named wrong fns)    gives up at iter 1
                                              ("need to identify the query first")

The Pequod flip is the surprise. Same scaffold, same model name, same sample index, same context length, and the model now navigates to the question, builds a chunking strategy, runs sub-queries, and produces "Goney/Albatross" in 10 calls. The behaviour I called characteristic in the first post stayed away this time.

The candidate explanations are three. HuggingFace's THUDM/long_bench-v2 may have repacked the sample between runs. Anthropic may have improved Haiku 4.5 silently. Or the original "gave up at iter 3" was a single noisy trial. Resolving between these would require rerunning the original pinned versions, which I can do but did not.

The useful move is to drop the uniform-failure framing. Across the three samples in the rerun, off-the-shelf Haiku succeeded once, refused once, and got confused once. The failure shape lives at the question-family level, with each task family pulling Haiku into a different stuck pattern.

The "RL training would fix this" claim has been doing serious load-bearing work in the RLM discourse. The cheap version of testing it is in-context priming. If a system prompt that nudges the model toward the right strategy closes most of the gap, then "needs RL training" was partly a prompt-engineering gap. If the prompt has no effect or hurts, RL training is the genuinely hard test and no cheap substitute exists.

I tried two priming designs. The verbose version was an 80-line preamble with a worked example showing the 3-step protocol: locate question marker, plan k chunks of ~60k each, run filter and leaf, then verify.

                  unprimed                   verbose primed
oolong (Pequod)   CORRECT                    wrong, off-task into whale etymology
sniah             refused                    1 of 9 events
codeqa            gives up at iter 1         window overflow (212k > 200k tokens)

Codeqa overflowed because the 1.2M-char document lives as a Python variable in the REPL, not inside the prompt, but the primed protocol made the model execute step by step, and the accumulated iter history eventually pushed total prompt tokens past 200k.

The minimal version was one paragraph with no worked example: "If you don't see a clear question, use context.find('Question:') to locate it before answering."

                  unprimed         minimal primed
oolong (Pequod)   CORRECT          window overflow (201,902 > 200,000)
sniah             refused          0 of 9 events ("no record found")
codeqa            gives up         hallucinated ("derivatives, reflect")

The minimal priming adds maybe 600 chars to the system prompt, a few hundred tokens at most. It pushed the run from somewhere near 199k tokens to 201,902. The implication is sharp: unprimed Haiku-RLM on these samples is operating right at the 200k-token limit. The scaffold has already saturated what fits.

Both priming variants were strictly worse than unprimed. More confidently wrong, more cleanly off-task, or pushed past the window entirely. Across two distinct prompt designs, the cheap RL proxy is gone.

The honest reading: the experiment leaves the actual RL training question open. An RL-trained model might learn to allocate window better, in which case priming was the wrong test entirely. What the experiment closes is the substitution question. To know whether RL training bridges the gap, somebody has to actually train.

Finding 4: Galanos's locality argument is real, narrowly

The original four-runtime axis had no slot for the model gets no structural say at all because the task spec already decides it. Galanos's Recursive by Design argues for exactly that position. The decomposition lives in code, derived from the task structure, before any leaf call runs. The model's only job is to answer leaf prompts.

I implemented this as a Plan-First runtime with a hard-coded planner per dataset family. Codeqa splits on Python def/class boundaries. Sniah splits into temporal windows. Oolong falls back to uniform 60k-char chunking with a relevance filter. Each planner inspects only (question, dataset) before any model call; once the plan is fixed, leaves run mechanically.

The headline: Plan-First wins decisively on codeqa, in both anchor states.

                  helpful Q                 absent Q
RLM               wrong (_zetasum)          wrong (fd)
λ-RLM             partial                   correct (diff)
Plan-First        correct                   correct

Splitting on def boundaries is a much better decomposition than uniform character windows for source code, and the planner has no dependency on the question's surface words. The leaves see entire function definitions and reason about each function structurally. The result holds in both anchor states, which is the strongest possible signal that the locality argument generalizes for code-shaped tasks.

Plan-First fails where I predicted it would fail. On the Pequod task, where the spec is generic ("answer this question about this book"), the planner has no structure to lean on and falls back to uniform coverage. With Haiku reducing across three surviving partials (Samuel Enderby, Bachelor, Rachel), the runtime refused to commit. λ-RLM with similar coverage answered correctly because its reduce template was tuned for QA over partials; the Plan-First reduce was more generic.

The cleanest statement of Galanos's argument, validated narrowly: when the task spec computably encodes structure, precomputed decomposition is the cheap baseline that is hard to beat. When the spec is generic, every runtime in the field fights over the same regime, and the choice between λ-RLM, RLM, and Claude Code falls back on the secondary axes. For LongBench-v2 the property is per-task-family. For Galanos's AEC report generation, the property holds globally because the work is always template-shaped.

Finding 5: the filter is doing more work than I credited

λ-RLM and Plan-First both use a relevance filter as the first move on every chunk. Does this section plausibly contain the answer? One LLM call per chunk, YES/NO with a one-line reason. Same abstract operation across both runtimes, and yet the prompt template radically changes recall.

On the Rodriguez task, λ-RLM's filter kept the wrong chunks in both anchor states. The physics paper takes up the majority of the document and looks dense and "important." The diary entries are short and look like noise. The filter, asked whether a chunk contains records relevant to a question about Steven Rodriguez, mostly said NO when it saw a diary entry and YES when it saw a physics section. The runtime then ran the QA prompt on the physics chunks and produced an off-topic summary, twice.

Plan-First's filter on sniah had the opposite failure: too conservative. It dropped 12 of 13 temporal windows on the helpful question, leaving only the window containing the very last Rodriguez entry. The runtime found Nov 29 and missed the other eight events.

Both failures point at a hidden axis the original post folded into "Filter+Map+Reduce" as if Filter were a single primitive operation. Filter is its own engineering problem with its own prompts, and the published λ-RLM defaults have failure modes that depend on the document's compositional structure as much as on its size.

The framing that holds after the rerun: λ-RLM's structural robustness to question-form anchors is paid for with a new sensitivity to document-form anchors. If the irrelevant content in the document looks more "relevant" to the filter than the relevant content (physics dense versus diary sparse, formal prose versus dialogue, structured tables versus narrative), the coverage story collapses. The runtime visits every chunk while never seeing the chunks that matter.

Filter recall is the under-discussed lever. The coverage advantage that makes λ-RLM beat Claude Code on Pequod is coverage of the filter's output, which equals coverage of the document only when the filter has high recall. A precise filter pays off. A generic filter prompt fails in a shape the architecture has to fix at the prompt level. The math above the filter operates on whatever survives, with full trust in the survivors and full blindness to whatever the filter quietly dropped.

What Prime Intellect is actually arguing

The Prime Intellect post is easy to read as one more piece of RLM hype. The structure underneath is more careful than that.

Prime Intellect's argument (paraphrased):

  premise        →  long-horizon agent work is the next frontier
                    (coding agents over weeks, research agents over months)

  problem        →  current approaches break here:
                    · stuffing context: O(n) cost, quality drops past some length
                    · file-system summarization (CC, Codex): lossy and irreversible
                    · hand-coded folding (AgentFold, ACE): rule-based, no transfer

  proposal       →  let the MODEL manage its own context via REPL + sub-LLMs
                    information stays out of the main window unless pulled in

  contingency    →  this requires RL training; off-the-shelf models won't do it well

  contribution   →  RLMEnv (plug-and-play training environment in `verifiers`)

The reported numbers (with GPT-5-mini, rather than Haiku):

DeepDive (multi-doc research)  :  RLM ≈ 2× reward lift
Oolong (long-doc QA, 1.75M)    :  RLM holds, flat-context baseline = 0
Verbatim-copy                  :  RLM consistently better
Math-python                    :  RLM UNDERPERFORMS (decomposition not useful here)

The math-python loss is the honest part. RLM is for tasks where context management is the bottleneck. A workflow that is just "do this math problem" decomposes into independent sub-LLM calls in a way that adds noise rather than signal.

Their frame is infrastructural rather than empirical. They are saying this is the architecture that will scale to long-horizon agents, and they are shipping the training environment to make it work. The value sits in the post-training curve, not in day-one capability on a frontier API model.

Reading their argument against the rerun: the RLM results (task-shaped Haiku, priming makes things worse) are consistent with their precondition. The patched λ-RLM result is a different research direction. The Plan-First result is the practical baseline they are trying to leapfrog with training, which is the test the priming experiment showed has no cheap substitute.

Where does the recursion live?

The thing the rerun made clearer is that the architectural axis underneath all five runtimes (counting the vanilla long-context call) reduces to a single question. Which actor in the system decides how to recurse over the document?

Five recursion paradigms ordered by how much of the orchestration logic the model gets to choose: vanilla long-context (no recursion), Plan-First (spec-level, task structure decides), λ-RLM (structural, math decides), RLM (programmatic, model writes Python), Claude Code (agentic, tool-call loop). Each makes a different trade-off between guarantees, flexibility, and adaptivity.

The choice of where the recursion lives determines what you can prove (proofs about Plan-First and λ-RLM apply; equivalent proofs about Claude Code do not), what you can train (RLM accepts end-to-end training on context use; Plan-First and λ-RLM accept it only at the leaf), and what you can adapt mid-run (Claude Code can swap strategy on the next tool result; Plan-First runs the plan it computed before the first call).

The trade-offs sit asymmetrically across the axis.

Spec-level recursion (Plan-First) buys cost predictability and quality together, in the regime where the task spec computably encodes structure. The model never has to rediscover the decomposition. Outside that regime, the planner has nothing to plan from, and the runtime collapses into λ-RLM with a worse reduce template.

Structural recursion (λ-RLM) buys guarantees: termination, cost, depth. The trade is that the runtime locks in its plan before the first leaf runs. A clever LLM in the leaves has no channel to redirect the planner; that information stays in the leaves and dies there. Filter recall becomes the secondary failure mode the original post missed.

Programmatic recursion (RLM) buys flexibility, the whole space of arbitrary Python code, in exchange for needing the model to actually be good at writing that code. Off-the-shelf Haiku uses the affordance unevenly across task shapes, priming leaves the gap intact, and RL training stays the open and expensive test.

Agentic recursion (CC) buys adaptivity, in the strict sense that every step responds to the last tool result. The bound is the tool surface itself. CC's vocabulary is grep, Read, and Bash; uniform-coverage scans of the kind λ-RLM ran on the Pequod sample live outside that vocabulary. The runtime is sensitive to anchor quality and commits hard when the anchor lies.

The genuinely open research question is whether programmatic recursion subsumes the other three once you train hard enough. Prime Intellect bets yes. The λ-RLM and Plan-First papers bet no, give up the freedom to get the guarantees. Claude Code bets that good tools plus a strong model wins without either.

The rerun left that question open. What it sharpened is the failure shape of each position. On Haiku 4.5, off-the-shelf, with no training, Pequod produces correct-correct-wrong-give-up across the four runtimes; codeqa produces give-up-correct-correct-partial; sniah refuses on three of four. Each shape responds to a different intervention, and priming responds to none of them.

When to actually use what

For long-context tasks on a current frontier model, the decision tree the experiment supports:

Long-context task on a current frontier model →

  1. Does the task spec computably encode structure?
       (e.g. template + dependency tree; codeqa = find function with property;
        multi-doc report; schema-constrained extraction)
       yes → Plan-First / Galanos-style. Hard-coded planner,
              leaf calls only. Cheapest and hardest to beat
              when this regime applies.

  2. If not, does the question have a surface anchor
     that the runtime's strategy would actually use?
       yes → tool-using agent harness (Claude Code).
              Grep/Read finds the anchor fast.
              Audit first: is the anchor misleading? (Pequod case)
       no → λ-RLM (uniform coverage with filter).
             Audit the filter prompt against the
             document's composition. If irrelevant
             content looks "relevant" to the filter,
             the coverage win evaporates (Rodriguez case).

  3. Multi-stage iterative work → Claude Code.

  4. Training a model on context use → RLM with RLMEnv.
       Priming is the experiment that leaves the gap open;
       two prompt designs both made Haiku strictly worse.

The cost picture, on the three samples after the patches:

            calls   wall      est. cost
CC          64      144s      $0.21
RLM         21      96s       <$0.50
λ-RLM       78      186s      ~$1.50  (post-patch)
Plan-First  148     240s      ~$0.50  (codeqa-heavy)

The λ-RLM cost is now driven by call volume rather than the codeqa pathology. Plan-First's per-call cost is dominated by codeqa's 110 filter and leaf calls. Both runtimes are within the same order of magnitude as Claude Code for most tasks. The differences live in failure shape rather than raw cost.

The case for typing, training, tools, and spec-driven planning

After the second cut, none of the runtimes is obviously right, and where the recursion lives remains genuinely contested.

The case for typing (λ-RLM). A runtime that provably terminates with bounded cost regardless of what the model does at the leaves is deployable without surprises. Patched, the property is real. On the Pequod sample, the typing pays off. The runtime locks in its plan before the first leaf, which means a clever LLM in the leaves stays clever inside the leaf and silent everywhere else. The rerun added a second cost the original framing folded away: filter quality matters more than the abstraction implied, and filters tuned for one document composition fail on another.

The case for training (RLM). A model trained to use a Python REPL for context management as fluently as Claude Code uses Read and Grep is a richer affordance than any fixed tool palette. The model decides each step, and the steps include arbitrary code. The affordance is unevenly used by untrained models, with the failure shape varying per task family, and priming leaves the gap intact. The bet is whether RL training on RLMEnv produces something that generalizes past Claude Code's strategy or just converges on it.

The case for tools (Claude Code). The model is already trained for tool use, the tools are debugged in production, and the Pareto frontier of cost-vs-correctness on tasks with usable anchors is currently best-in-class. The runtime is bounded by what the tools express, and sensitive to anchor quality; a misleading anchor commits Claude Code to a wrong answer in 24 grep-and-read calls.

The case for spec-driven planning (Plan-First). When the task spec computably encodes structure, precomputing decomposition delivers cost predictability and correctness together. The bet is Galanos's bet, that the interesting long-context tasks are template-shaped or schema-shaped or otherwise structurally legible to a planner. When they are, this baseline is hard to beat. When they are not, the choice falls back to one of the other three.

The thing I want to see, and currently lack the compute for, is what an RL-trained RLM model actually does on the Pequod question. One answer: it learns to grep first, then read targeted offsets, converging on Claude Code's strategy from the inside. The other answer: it learns recursive sub-LLM call patterns that live outside the stateless-tools vocabulary, opening a region of strategy space only the RLM affordance can reach. The first answer makes Claude Code the asymptote that programmatic recursion approaches. The second answer makes RLM the architecture that absorbs all four positions on the axis.

Prime Intellect's bet is the second answer. The rerun killed the cheap proxy for testing it. The honest version of the bet is still falsifiable, and the question over the next year is whether anyone actually trains the model that decides between the two answers. The locality argument (Plan-First) is the cheap baseline to ship while waiting. The filter recall problem is the engineering problem to solve underneath both.

The control question underneath all of this is the one the post's title points at: long-context performance is a question about who decides what gets attended to. The architectures are five answers to that question, the failure shapes are five different costs of getting the answer slightly wrong, and the next year of work is whether one of those answers eats the others.

Files

Code, traces, patches, and the full reconciled writeup live in ~/dev/experiments/lrlm-vs-cc/:

results/run-003-largest/         ← original three-runtime experiment
results/run-004-mods/            ← rerun with patches, Plan-First, anchor ablation, priming
  COMPARISON.md                  ← full reconciled writeup
  mod1-anchor/                   ← anchor-presence ablation, 18 trials
  mod2-smoke/                    ← _Phi patch verification
  mod3-planfirst/                ← Plan-First runtime, 3 samples
  mod4-priming/                  ← unprimed + verbose primed RLM
  mod4-priming-minimal/          ← minimal priming variant
patches/
  0001-fix-phi-depth-budget.patch  ← upstream-able λ-RLM fix

Two scripts did the original work. extract_sample.py pulls a sample from the lambda-rlm benchmark loaders and writes context, question, and gold to disk. run_with_traces.py instruments AnthropicClient to log every leaf call. The rerun added a harness/ package (budget.py, tracer.py, plan_first.py, rlm_priming.py, anchor_pairs.py) and four mod runners. Total Haiku spend for the rerun: $6.22 across 31 trials.

The Claude Code invocation worth remembering for any benchmark on this stack:

claude --bare --print --output-format stream-json \
       --model claude-haiku-4-5-20251001 \
       --no-session-persistence --max-budget-usd 5 \
       --add-dir=. --allowed-tools=Read,Grep,Glob,Bash \
       "Read question.txt and answer it from context.txt"

--bare strips hooks, skills, MCP, and CLAUDE.md auto-discovery so the comparison stays clean of local environment, and forces ANTHROPIC_API_KEY auth.

← back to index

Long context is a control problem

The Pequod question#

The four bets#

What I ran#

Headline numbers#

Finding 1: the runtime's strategy decides anchor sensitivity#

Finding 2: the λ-RLM bug suppressed correctness as well as inflating cost#

Finding 3: Haiku's failure shape is per-task, and priming leaves the gap open#

Finding 4: Galanos's locality argument is real, narrowly#

Finding 5: the filter is doing more work than I credited#

What Prime Intellect is actually arguing#

Where does the recursion live?#

When to actually use what#

The case for typing, training, tools, and spec-driven planning#

Files#