Long context is a control problem
Three runtimes — RLM, λ-RLM, Claude Code — tested head-to-head on the same long-context tasks on Haiku 4.5. The architectural question underneath them all is where the recursion lives. λ-RLM was the only runtime to find the Pequod's first ship; its cost-bound math was broken in the implementation; RLM without RL training wastes its own affordance. What Prime Intellect is actually betting on, and the open research question.
Almost every approach to long-context language modeling looks, from far enough away, like the same question asked four different ways: where do we put the recursion? Inside the model's window (just stuff it all in). Inside an LLM-driven loop (let the model write Python that decides what to read next). Inside a fixed combinator chain (let the math decide). Inside a tool harness (let an agent navigate). Each answer is a different bet about what the LLM is good at and what should be taken away from it.
I spent a day testing three of those bets head-to-head — RLM, λ-RLM, and Claude Code — on the same model and the same long-context tasks. The point wasn't to crown a winner. It was to see what each runtime is, operationally, when you actually run it. This post is that field log, with the traces and the architectural reading underneath them.
The Pequod question
I gave each runtime the same question — "What is the first ship the Pequod meets?" — and the same 1.2-million-character copy of Moby-Dick on Claude Haiku 4.5. They had the document, the same compute budget, the same model. The gold answer is Albatross.
RLM → "I need to first understand what the actual question is..." (gave up at iter 3)
Claude Code → "The Jungfrau" (wrong — that's Chapter 71)
λ-RLM → "the Goney (or Albatross), followed by the Town-Ho..." (CORRECT — Goney is the
ship's name on first
introduction in Ch. 52)
Three runtimes, same model, same problem. One got it right. One got it confidently wrong. One gave up. The interesting question isn't which one won — it's why this distribution was the one we got, and what it tells us about where to put the recursion.
The three bets
A confusing thing about the recent literature is that several quite different ideas all get called "recursive language models" or get talked about in the same breath. They are not the same.
RLM — Alex Zhang's October 2025 blog post and the architecture Prime Intellect is championing as the paradigm of 2026. The model gets a Python REPL. The long input lives as a variable in that REPL. The model writes code — peek, slice, grep, summarize, recurse via llm_query — and decides decomposition strategy at runtime. The bet: give the model the affordance, then train it via RL to use the affordance well.
λ-RLM — a different paper, different authors. The decomposition is a fixed combinator chain Split → Filter → Map → Reduce. The plan (k*, d, τ*) is computed by math before any leaf LLM call runs, from a lookup table on document size and task type. The LLM only ever sees one of two prompt templates: the relevance filter, or the leaf QA template. The bet: typed structural control beats LLM-generated control because you can prove things about it — termination, cost bounds, depth.
Tool-using agent harness — Claude Code, Codex. Model + small system prompt + Read / Grep / Bash. The long context lives on disk. The model navigates rather than ingests. The bet: good tools plus a frontier model give you practical long-context performance today, without inventing a new scaffolding.
These are different research directions, not different names for the same thing. They sit on a spectrum that, despite the surface diversity, is really just one axis: how much of the orchestration logic is allowed to be the model's choice.
λ-RLM is the most constrained — the model never decides anything structural; it just answers leaf prompts. Claude Code is the most adaptive — every step is a model decision. RLM is in between: the model has freedom, but inside a single recursive scaffold, and the bet is that you teach it to use that freedom well via RL.
These are the three bets the field is currently making. The experiment was a test of all three at once.
What I ran
- Model: Claude Haiku 4.5, leaf model for all three runtimes. No RL training on any of them — they all use the off-the-shelf API model.
- Datasets: the largest sample (the 128k bin) from each of
oolong(single-doc QA),sniah(sequential needle-in-haystack), andcodeqa(code understanding) from LongBench-v2. - λ-RLM context window:
K = 750,000chars, close to Haiku's actual 200k-token window. (An earlier run with the bench's defaultK = 50,000artificially fragmented even small documents — irrelevant to the runtime's real behavior.) - Claude Code:
claude --bare --print --output-format stream-json --allowed-tools=Read,Grep,Glob,Bash. The--bareflag strips hooks, skills, MCP, andCLAUDE.mdauto-discovery so we measure the runtime alone. - Per-call traces: I wrapped Anthropic's Python client with a
TracingAnthropicClientthat appends every leaf call to a JSONL. The bench's default logger writes summary stats only, and λ-RLM'sloggerfield is declared but unused — I checked.
The three samples:
| Dataset | Doc chars | Question | Gold |
|---|---|---|---|
| oolong/128k | 1,200,000 | "What is the first ship the Pequod meets?" | Albatross |
| sniah/128k | 371,686 | "List what Steven Rodriguez did during 11, 2004 in chronological order" | 9 events spanning Nov 7–29 |
| codeqa/128k | 1,200,000 | "Which keyword arguments are recognized by the function that calculates the derivative?" | h, method, direction |
oolong and codeqa exceed Haiku's window (≈300k tokens). sniah fits.
Headline numbers
Three runtimes × three samples = nine trials. Wall time and call counts:
oolong-128k sniah-128k codeqa-128k
----------- ---------- -----------
RLM 3 calls / 16s 5 calls / 25s 21 calls / 63s
λ-RLM 40 calls / 67s 2 calls / 6s 563 calls / 1372s (≈23 min)
Claude Code 24 calls / 59s 5 calls / 12s 35 calls / 73s
($0.21 total cost
across all 3 CC runs)
Substantive correctness, judged manually because the bench's token-F1 metric is broken for free-form output (gold is 1–2 sentences, runtime outputs are multi-paragraph, F1≈0 even when right):
| oolong | sniah | codeqa | |
|---|---|---|---|
| RLM | gave up | partial (1 of 9) | wrong (named wrong functions) |
| λ-RLM | CORRECT | refused | wrong (after 23 min) |
| Claude Code | wrong (Jungfrau) | partial (1 of 9) | partial (2 of 3 args) |
Three things in this picture were not what I expected.
Finding 1 — coverage beats search when the question has no anchor
The Pequod's first encounter is Chapter 52, "The Albatross." But the chapter title is just "The Albatross" — not "The Pequod meets the Albatross". The much later Chapter 71 is titled "The Pequod Meets the Jungfrau."
So if you grep for "Pequod meets" — which is what Claude Code did — the first hit is the Jungfrau. You commit confidently to a wrong answer that grep believed in.
Here's what each runtime actually did. Real traces.
Claude Code: 24 tool calls, 59s, wrong
1. Read("question.txt")
2. Read("context.txt") → blocked: file > 256KB
3. Bash("grep -n 'Pequod meets' context.txt")
→ hit: "Chapter 71: The Pequod Meets the Jungfrau"
4. Read offset=16100 limit=50 → reads Chapter 71 region
5-23. ... 19 more grep+read variations trying to disambiguate ...
24. Final answer: "The Jungfrau"
The grep strategy assumes the question's keywords point at the answer's location. The Pequod question explicitly violates that assumption — and there's no in-loop signal that would tell the model "your search anchor is misleading." CC ran deep down a wrong path and committed.
RLM: 3 calls, 16s, gave up
The model is given a Python REPL where the document is context_0. It can call llm_query(prompt) to recurse to itself.
# iter 1, code block 1
print(f"Context length: {len(context)}") # → 1,200,072
print(context[:500]) # → preface text
# iter 1, code block 3 — model decides to ask itself
summary = llm_query(f"What is this context about?\n\n{context[:5000]}")
# iter 1's final response:
"I need to first understand what the actual question or task is that I need
to complete. I've examined the context which contains 1,200,072 characters,
but the specific query or prompt that I need to answer is not clear..."
The agent loop has a 6-iteration cap. Haiku gave up at iter 3 because the "Question:" marker was buried in the giant prompt and Haiku didn't think to navigate to it. This is what off-the-shelf models do with this scaffolding when they haven't been trained on it. More on this in Finding 3.
λ-RLM: 40 calls, 67s, correct
λ-RLM has no agent loop. The runtime's entire control flow is this 9-line Python function (real code, generated as a string and exec'd in the REPL):
def _Phi(P):
if len(P) <= 60000: # base case: chunk fits, leaf call
return llm_query(template.format(text=P, query=query))
else:
_raw = _Split(P, 20) # split into 20 chunks
_pairs = [(_raw[i], _Peek(_raw[i], 0, 6000)) # peek at each chunk's start
for i in range(len(_raw))]
_chunks = _FilterRelevant(query, _pairs) # 1 LLM call per chunk: YES/NO
return _Reduce([_Phi(c) for c in _chunks]) # recurse on survivors, then merge
What the planner computed before any leaf call ran (logged automatically):
[λ-RLM] n=1,200,000 chars | task=qa
[λ-RLM] k*=20 τ*=60000 d=1 ⊕=select_relevant
Translation: split into 20 chunks of ~60k chars, run Filter+Map+Reduce, depth 1.
Only chunk 12 — the one containing chapters 51–55, where the Goney/Albatross is named — passed the filter. The Map step ran on it (one call, 6736-char prompt → 1649-char answer), the Reduce got a single survivor and short-circuited. Final answer:
"the Goney (or Albatross), followed by the Town-Ho, the Jeroboam, the Rosebud, the Enderby..."
This is the case the λ-RLM paper is selling, and it reproduced cleanly: when the answer's location has no keyword anchor in the question, uniform coverage of the document beats targeted search. The Map/Reduce structure visits every chunk, can't be misled by a wrong chapter title, and the relevance filter culls the noise.
There's a real concept hiding here, worth naming. Most real long-context QA systems implicitly assume the question contains a navigation hint — that what you're looking for can be located by searching for its surface form. The Pequod question is a deliberate example of a question that violates that assumption. For tasks of that shape, the runtime that can't be misled by a wrong anchor is the runtime that wins. λ-RLM achieves that by giving up adaptivity. Claude Code can't because adaptivity is its whole strategy. The trade-off is genuine.
Finding 2 — the formal cost bound is broken in the implementation
Same K=750k, same 1.2M-char document, two different datasets:
oolong/128k (Moby-Dick prose) → 40 calls / 67s
codeqa/128k (Python source) → 563 calls / 1372s (≈23 min)
The codeqa run took 14× longer than the oolong run despite identical configuration and identical document size. The math says they should be the same. The planner output for codeqa was identical to oolong:
[λ-RLM] n=1,200,000 chars | task=qa | k*=20 τ*=60000 d=1
Same plan, dramatically different execution. So what broke?
I went into the planner code. The bug is in _Phi's recursion structure:
The recursion bottoms out on len(P) ≤ τ_star, not on a depth counter. _Split does word-boundary snapping with a ±20% margin, which can produce chunks up to τ* × 1.2 = 72,000 chars. Those exceed τ* and fall through into the recursive branch, splitting again into 20 sub-chunks of ~3k chars each.
For oolong (Moby-Dick prose, lots of natural word boundaries), most chunks ended up at or under τ* after snapping. d=1 in practice, ~40 calls, ~67s.
For codeqa (Python source with long lines and few good break points), more chunks ended up over τ*. From the trace, 257 leaf calls had average prompt size 4,580 chars — way smaller than the planned 60k. Effective d=2 for many branches. Total: 1 task-detect + 289 filter + 257 leaf + 16 reduce = 563.
This contradicts the paper's closed-form cost analysis. The bound T(n) ≤ k*^d × T(K) + d × C_⊕(k*) assumes d is honored. The implementation lets chunk-size at runtime override d, so the formal guarantee in the paper is not actually a guarantee in this codebase.
A real fix is small — either truncate to τ* instead of re-splitting, or pass an explicit depth budget into _Phi and bottom out on it. It's the kind of thing that would matter a lot in production and is invisible from the paper. Worth flagging because the formal guarantee is the entire reason to choose this runtime over a tool-using agent. If the bound doesn't hold, you're paying for typing without getting the predictability that justified it.
Finding 3 — without RL training, the affordance is wasted
I showed Haiku's iter-1 output above. On the small (8k bin) sample where it produced an answer, the strategy was telling. Across two iterations on a 163k-char document:
iter 1: print(context[:500]) ← tool use
iter 1: print(context[:2000]) ← tool use
iter 1: llm_query("summarize this", context[:5000]) ← sub-LLM call
iter 1: print(context[80000:82000]) ← random middle peek
iter 1: llm_query("analyze entire text", context) ← sends WHOLE 163k chars to itself
iter 2: context.find("Intelligent Courts") ← finds position
iter 2: llm_query("extract section", context) ← whole doc AGAIN
iter 2: llm_query("answer the question", context) ← whole doc a THIRD time
The Python REPL is technically a tool-use surface — the model can grep, slice, navigate. What Haiku actually did was use the REPL to format prompts that send the entire document to itself. Three times. The find() calls in between were ornamental. The big sample was even worse — Haiku gave up at iter 3 without producing an answer, because the "Question:" marker was buried in a 1.2M-char system prompt and nothing in the model's training pushed it to navigate to find the question first.
This isn't a refutation of the RLM idea. It's exactly the regime Prime Intellect explicitly says doesn't work yet:
we acknowledge current API-called models underutilize the scaffolding [...] the true potential of RLM and context folding will be unleashed after being trained via RL.
This connects to a broader picture. The empirical phenomenon that justifies the whole "scaffolding for long context" research direction is context rot — the lost-in-the-middle effect where a model's effective recall degrades nonlinearly as the context grows, especially in the middle of the input. If you don't believe in context rot, just stuffing the document in is fine. If you do, you need some mechanism for keeping the working window small. RLM, λ-RLM, and CC are three different ways to get the same property: most of the document never sits inside the model's working window at any given step.
The other piece is the prompt-as-program framing: instead of treating the prompt as a fixed string, treat it as a Python variable that the model itself can manipulate. This is the philosophical move that makes RLM possible. It also implies: the model needs to know how to manipulate the variable. Off-the-shelf Haiku doesn't, and our run shows what happens when it doesn't. Prime Intellect's bet is that an RL-trained model will. We can't test that on API Haiku.
What Prime Intellect is actually arguing
Their post is easy to read superficially as just another "RLM is great" hype piece. It isn't. The argument has structure.
Prime Intellect's argument (paraphrased):
premise → long-horizon agent work is the next frontier
(coding agents over weeks, research agents over months)
problem → current approaches break here:
· stuffing context: O(n) cost, quality drops past some length
· file-system summarization (CC, Codex): lossy and irreversible
· hand-coded folding (AgentFold, ACE): rule-based, no transfer
proposal → let the MODEL manage its own context via REPL + sub-LLMs
information stays out of the main window unless pulled in
contingency → this requires RL training; off-the-shelf models won't do it well
contribution → RLMEnv (plug-and-play training environment in `verifiers`)
Their reported numbers (with GPT-5-mini, not Haiku):
DeepDive (multi-doc research) : RLM ≈ 2× reward lift
Oolong (long-doc QA, 1.75M) : RLM holds, flat-context baseline = 0
Verbatim-copy : RLM consistently better
Math-python : RLM UNDERPERFORMS (decomposition not useful here)
The math-python loss is the honest part. This approach is for tasks where context management is the bottleneck. It's not a panacea. A workflow that's just "do this math problem" doesn't decompose into independent sub-LLM calls in a useful way.
The strategic frame, then, is infrastructural rather than empirical. Prime Intellect is not saying "RLM works today on a frontier API model." They're saying "this is the architecture that will scale to long-horizon agents, and here is the training environment to make it work." The value lives in the post-training curve, not the day-one capability.
Reading their argument against my experiment: my RLM result (Haiku gave up on big oolong) is consistent with their precondition; my λ-RLM result (it solved the Pequod question, but its cost is unpredictable in practice) is a different research direction with a different bug; my Claude Code result is the practical baseline they're trying to leapfrog.
Where does the recursion live?
The thing I keep coming back to, after looking at all four runtimes (counting the vanilla long-context call), is that the architectural axis underneath them all is one question: which actor in the system gets to decide how to recurse over the document?
This isn't a labels exercise. The choice of where the recursion lives determines what you can prove (things you can prove about λ-RLM you can't prove about Claude Code), what you can train (RLM can be trained end-to-end on context use; λ-RLM can't, the math is fixed), and what you can adapt (Claude Code can change strategy mid-run; λ-RLM cannot).
The trade-offs aren't symmetric:
- Structural recursion buys guarantees — termination, cost, depth — at the cost of not reacting to what the leaves return. λ-RLM is doomed on tasks where the right strategy depends on something only the leaves can see.
- Programmatic recursion buys flexibility — the model can write arbitrary code — at the cost of needing the model to be good at writing that code. Off-the-shelf models aren't.
- Agentic recursion buys adaptivity — every step responds to the last tool result — at the cost of being bounded by the tools you give. CC can grep but can't run a uniform-coverage scan the way λ-RLM did on the Pequod question.
The genuinely open research question is whether programmatic recursion can subsume the other two if you train hard enough. Prime Intellect's bet is yes. The λ-RLM paper's bet is no — give up the model's freedom to get guarantees. Claude Code's bet is that you don't need either, you just need good tools.
We don't yet know who's right. We do know that on Haiku 4.5, off-the-shelf, the Pequod question went 1-for-3.
When to actually use what
For long-context tasks on a current frontier model, here's the heuristic the experiment supports:
Cost-wise, on the three samples I ran:
calls wall est. cost
CC 64 144s $0.21
RLM 29 104s <$0.50
λ-RLM 605 1444s ≈$15
(95% spent on codeqa runaway)
The λ-RLM cost is dominated by the codeqa pathology — fix the _Phi recursion bug and it'd drop ~10×. Even then, λ-RLM is the most expensive of the three for tasks it doesn't structurally fit.
The case for typing, the case for training
None of the three runtimes is obviously right, and where the recursion lives is genuinely contested.
The case for typing (λ-RLM): if your runtime provably terminates with bounded cost regardless of what the model does at the leaves, you have something deployable without surprises. The property is real, and on the Pequod sample it pays off. The cost is that the runtime can't react to the document — a clever LLM in the leaves can't tell the planner "this whole question is about chapter 52, skip the rest." That channel of information is thrown away by construction.
The case for training (RLM): if you can train a model to use a Python REPL for context management as fluently as Claude Code uses Read and Grep, you get a richer affordance than any tool palette. The model decides each step, and the steps include arbitrary code. The trouble is the affordance is wasted on untrained models. The bet is whether RL training on RLMEnv produces something that generalizes past Claude Code's strategy or just converges on it.
The case for tools (Claude Code): the model is already trained for tool use, the tools are debugged, and the Pareto frontier of cost-vs-correctness is currently best-in-class. The cost is being bounded by what the tools express. Read and Grep can't do uniform coverage. But Bash can run Python — and on codeqa, CC wrote four Python scripts via Bash to do structured parsing the model couldn't do via grep. The agent harness can simulate parts of the RLM scaffold opportunistically. Whether that closes the gap is itself an open question.
The thing I'd most like to see, and don't yet have the compute for, is what an RL-trained RLM model actually does on the Pequod question. Does it learn to grep first, then read targeted offsets — converging on Claude Code's strategy from the inside? Or does it learn something cleverer: recursive sub-LLM call patterns that stateless tools can't express? The first answer means Claude Code is an asymptote that programmatic recursion approaches but doesn't beat. The second means there's a region of strategy space only the RLM affordance can reach.
Prime Intellect's bet is the second answer. It's a real bet, and it's falsifiable. The interesting work over the next year is whether they can make it pay off.
Files
The full experiment, traces, and writeup live in ~/dev/experiments/lrlm-vs-cc/:
results/run-003-largest/
COMPARISON.md ← full writeup
SUMMARY.json ← machine-readable results
oolong_128k/ (and sniah_128k/, codeqa_128k/)
bench.log ← stdout from the RLM bench
cc_trace.jsonl ← Claude Code stream-json
summary.json ← per-trial outputs
traces/
rlm_idx12_128k.jsonl ← every leaf call from RLM
lambda_rlm_idx12_128k.jsonl ← every leaf call from λ-RLM
rlm_iterations/ ← per-iteration RLMLogger output
workspace/ ← the CC sandbox
gold.txt
Two scripts did the work: extract_sample.py pulls a sample from the lambda-rlm benchmark loaders and writes context+question+gold to disk; run_with_traces.py instruments AnthropicClient to log every leaf call and runs RLM and λ-RLM with K=750000.
The Claude Code invocation worth remembering for any benchmark on this stack:
claude --bare --print --output-format stream-json \
--model claude-haiku-4-5-20251001 \
--no-session-persistence --max-budget-usd 5 \
--add-dir=. --allowed-tools=Read,Grep,Glob,Bash \
"Read question.txt and answer it from context.txt"
--bare strips hooks, skills, MCP, and CLAUDE.md auto-discovery so the comparison isn't contaminated by the local environment, and forces ANTHROPIC_API_KEY auth.