writing

Proto-Continual Learning

2026-04-01ai continual-learning investing

An exploration of how vertical AI companies build primitive learning loops around foundation models, why the ability to verify outputs cheaply and accurately determines which companies survive model improvement, and what that means for code, legal, healthcare, and customer support verticals.

Part I

Why This Works at All

Section 1

The Compositionality Miracle

Deep learning scaled for a single reason that is easy to state and hard to internalize: representations compose. A convolutional filter that detects edges combines with one that detects corners to recognize a face, and that face-recognizer combines with a pose-estimator to understand a scene. At each layer, the system recombines what it already knows into something it was never explicitly taught. This is not retrieval. It is synthesis.1

The same compositionality operates in language models, and its implications are stranger than most people realize. A model trained on a large corpus of code does not merely learn to complete Python functions. It exhibits improved reasoning on tasks that have nothing to do with programming — logic puzzles, mathematical proofs, causal inference.2 The code did not teach the model how to reason about philosophy. But internalizing the structure of code — conditionals, recursion, type constraints, invariant-preserving transformations — created neural representations that generalize to reasoning writ large.

This is the compositionality miracle, and it is the single fact that makes foundation models the center of the current wave. They are the only substrate where compositional learning happens automatically. No one designed a "transfer reasoning from code to philosophy" module. The architecture discovered it.

"Consider how neural pretrained knowledge of programming can assist with general, non-code reasoning, while retrieving quality code snippets would hardly exhibit the same effect."
— Ilija Lichkovski, "Defining Continual Learning" (April 2026)

Contrast this with retrieval. A RAG system that fetches the ten most relevant code snippets from a vector store and inserts them into the context window gives the model more information for that specific query. But the model's next query starts from the same baseline. No compounding. No transfer. The retrieved snippets do not make the model smarter at anything except the immediate task, and even that benefit degrades as the knowledge store grows and retrieval becomes noisier.

Parametric knowledge — knowledge encoded in the weights — is a fundamentally different asset class. It changes the intelligence per forward pass. A model that has internalized accounting standards does not need to retrieve ASC 606 every time a revenue recognition question appears; it reasons about revenue the way a trained accountant does, fluidly and automatically, recombining that knowledge with whatever else the query demands. The compounding potential is not incremental. It is exponential, for the same reason that a human who understands calculus learns physics faster than one who looks up derivatives from a table.

Figure 1 — The Learning Stack

The foundation model is the only layer where knowledge compounds. Everything outside it — harness, evals, domain outcomes — adds information per retrieval but not intelligence per forward pass.

This is the key insight that everything else in this essay builds on: parametric knowledge has compounding potential. Retrieved knowledge does not. The entire structure of the AI industry — who survives, who gets absorbed, where moats form, where they dissolve — follows from this single asymmetry.

Section 2

The Stability-Plasticity Tradeoff

If parametric knowledge is the compounding asset, then the question becomes: how do you add to it after deployment? How do you teach a model to handle medical billing codes without destroying its ability to write Python? This problem — continual learning — has haunted the field since McCloskey and Cohen demonstrated catastrophic interference in 1989:3 train on task A, then train on task B, and the model forgets A. Thirty-seven years later, it remains unsolved.

What would a genuine solution require? We can distill it to a tension between five properties that any continually learning system needs but no existing system achieves simultaneously:

Preservation Learning new things must not break old ones. A legal AI that masters regulatory compliance must not lose its contract analysis capability in the process.
Sequential Ingestion Data arrives as a stream, not a batch. You cannot pause the world to retrain. The system must learn from new information as it arrives.
Distribution Tolerance New data will look different from old data — that is precisely why you need to learn it. A system that only handles near-identical distributions has not solved anything.
Efficiency You cannot replay trillions of pretraining tokens every time you need to incorporate yesterday's conversations. Continual learning must learn more from less.
Compositionality The deepest requirement. Skills learned at different times must combine. A model trained on A and B, then later on C and D, must generalize to A-versus-C — a pair never seen together. Without this, you have memorization. With it, you have cumulative intelligence.

The fifth property is where the tension becomes irreconcilable, and why this matters for vertical AI. Foundation models achieve compositionality through pre-training — they see everything at once, and joint training creates representations that automatically recombine. But they cannot learn sequentially post-deployment without catastrophic forgetting. Harness-based memory (RAG, skill files, knowledge graphs) achieves sequential learning trivially — just append a document. But it cannot compose: the new document does not create neural representations that transfer to unrelated tasks.

Figure 2 — Foundation Models vs. Harness-Based Memory on the Five Desiderata

No system achieves all five desiderata. Foundation models excel at compositionality and preservation through joint pre-training but cannot learn sequentially post-deployment. Harness-based memory excels at sequential learning and efficiency but lacks compositional depth. This tension is not a temporary engineering gap. It is the entire game.

The formal CL literature4 distinguishes three difficulty levels: task-incremental (model knows which task), domain-incremental (same task, shifting distribution), and class-incremental (must distinguish classes learned at different times). Real-world verticals require the hardest setting — a legal AI that learned contracts in January and regulatory compliance in March must handle queries requiring both.

Three decades of research — elastic weight consolidation,5 progressive networks,6 gradient episodic memory7 — each solve a subset of these five properties. Two recent results matter most: sparse continual learning reduces forgetting from 89% to 11% by updating only a fraction of parameters per task,8 and Shenfeld et al. show that on-policy RL retains 93% accuracy on prior tasks while supervised fine-tuning degrades catastrophically.9 The training algorithm matters as much as the architecture. But the fundamental tension — stability versus plasticity — is not waiting for a scaling fix. It shapes every strategic decision vertical AI companies must make today.

Section 3

The Harness Ceiling

Since parametric continual learning remains unsolved, the industry has built the next best thing: harnesses. A harness is the entire non-model infrastructure that wraps a foundation model — the system prompt, the tool definitions, the RAG pipeline, the skill files, the memory store, the workflow orchestration. Claude Code's CLAUDE.md, Cursor's tab completion context, Harvey's legal workflow engine — all harnesses. They work. They work remarkably well for a surprisingly long time. But Lichkovski's argument is that they hit a ceiling, and the ceiling is hard.

There are two reasons, and they are distinct.

Reason 1: The Scaling Wall

Harness memory has a fixed-model ceiling. As your knowledge bank grows from ten skill files to a hundred to a thousand, three forces conspire against you. Context rot: the model's attention degrades over long contexts, and you cannot fit everything into the window. Retrieval difficulty: which of your thousand KV caches or markdown files is relevant to this specific query? The retriever becomes the bottleneck, and retriever quality scales sublinearly with knowledge bank size. Diminishing returns: more knowledge does not equal a smarter model. The underlying model is the same; you are just giving it more to sift through. This is the inverse of compositionality — instead of knowledge combining multiplicatively, it accumulates additively and eventually destructively.

"Parametric knowledge fundamentally changes the amount of intelligence per forward pass, which holds significantly greater compounding potential."
— Ilija Lichkovski

Reason 2: The Automaticity Gap

Neural memory recombines automatically. A model that has internalized the structure of legal precedent and the structure of financial regulation does not need a retriever to notice when a case involves both — the representations are entangled in the weights, and the connection surfaces on its own. This automatic recombination is a prerequisite for creativity. Novel ideas emerge from unexpected connections between domains, and those connections must be made without someone (or something) deciding in advance to look for them.

Harness memory requires retrieval, and retrieval requires intent. You have to know what to look for before you find it. As the knowledge bank grows, the retriever becomes the single point of failure for creative synthesis. A retrieval miss is a thought that never occurs.

This is not a theoretical concern. It is the practical experience of every team building agents at scale. Vercel's v0 went from 15 tools to 2. SWE-agent's 13 custom ACI commands compressed to 100 lines of bash in mini-swe-agent, achieving 74%+ on SWE-bench Verified.11 Every model release, the Claude Code team deletes a chunk of their system prompt. The harness layer is being compressed, and the compression is being driven by the model absorbing what the harness used to provide.

The Key Nuance

The ceiling height varies by domain. The amount of cross-cutting, compositional reasoning a task requires determines how quickly harness-based approaches hit diminishing returns. This is not one ceiling — it is a landscape of ceilings.

Figure 3 — Harness Ceiling by Domain

CX / Support

HIGH

Conversations are mostly independent

Accounting

MEDIUM

Cross-entity rules, period-end judgments

Legal

MEDIUM

Precedent chains across jurisdictions

Code (complex)

LOW

Cross-file, cross-module compositionality

Research

LOW

Insight requires cross-domain synthesis

Customer support has a high ceiling because conversations are largely independent — each ticket is a fresh context, and cross-conversation knowledge adds little. Coding has a low ceiling because meaningful software work requires understanding how changes in one file affect a dozen others. The ceiling is determined by how much cross-cutting compositionality the domain requires.

This variance in ceiling height has a direct consequence for vertical AI companies. Where the ceiling is high, harness-based approaches can sustain a viable business for years. Sierra's customer experience constellation, with 2 million+ conversations per month and outcome-based pricing, operates in a domain where each conversation is mostly self-contained. The harness ceiling is high, and they can compound their data advantage within it.

Where the ceiling is low, the model must do the heavy lifting. SWE-EVO sits at 21% not because the harness is bad but because 21-file, 874-test software evolution tasks require compositional reasoning that no amount of context stuffing can provide. The 100-step error compounding problem (at 95% per-step accuracy, a 50-step task succeeds 7.7% of the time) is a parametric limitation, not a retrieval one. But even here, the model may have more room than the harness suggests: Snell et al. showed that a small model with optimally allocated test-time compute can match a 14x larger model, implying that the ceiling is partly about how you spend inference, not just what's in the weights.10

This tension — the varying ceiling height across domains, the complementary strengths of parametric and harness-based approaches, the unsolved problem of continual learning — is the landscape that the next section maps. Not as a dichotomy, but as a topology.

· · ·

Part II

The Domain Topology

Section 4

The Absorption Wave

Here is the one-sentence version of everything you need to know about which capabilities foundation models absorb and which they leave alone: models absorb capabilities at the rate determined by verification fidelity.13

Verification fidelity is determined by three variables: how fast the environment can score the model's output, how cheap that scoring is, and how accurate it is. Where all three are maximized — instant, free, perfect — the model absorbs the capability and the scaffolding around it compresses. Where any one is constrained, the capability remains external to the model, and the scaffolding persists.

This is not a metaphor. It is the literal mechanism by which post-training (RL) works. The bottleneck in RL is not compute or data — it is the environment's ability to score the model's output. Foundation model labs spend in excess of $1 billion per year on RL-able data and environments. Anthropic alone planned to spend over $1 billion on RL environments in 2026. The third-party ecosystem — Scale AI ($29B), Surge AI, Mercor — exists to expand the surface area of what can be scored.

Figure 4 — The Verifiability Spectrum

STRONGLY VERIFIABLE

WEAKLY VERIFIABLE

INFERABLE

NON- VERIFIABLE

	Strongly Verifiable	Weakly Verifiable	Inferable	Non-Verifiable
Signal	Machine checks. Perfect.	Learned judge. Noisy but scalable.	User behavior as signal.	Human taste. Can't scale.
Speed	Instant	Seconds	Hours	Days
Cost	Free	Cheap	Free (but slow)	Expensive
Accuracy	~100%	~80%	Noisy	Can't scale
Examples	Unit tests, Lean proofs, sim-to-real	LLM-as-judge, PRMs, COMET	Cursor accept/reject, ad CTR, app ratings	Video aesthetics, strategy, negotiation, brand voice
Lab strategy	RL runs aggressively here	Building verifiers to push left	Collecting production signal	No path without better verifiers

The labs' strategy is a single repeated action: move every domain leftward on this spectrum. Build a better verifier for a task that was previously unverifiable, expand the verification fidelity, run RL, absorb the capability. The entire $1B+ annual spend on RL environments is an investment in pushing the frontier of verifiability.

The spectrum has an embedded prediction: anything on the left gets absorbed into the model. Anything on the right stays external. And the boundary moves leftward over time.

Consider the trajectory of code quality verification. In 2020, the verifier was a senior engineer reviewing pull requests — slow, expensive, accurate but subjective. By 2022, linters and static analysis tools automated the trivially verifiable subset. By 2024, SWE-bench created a standardized evaluation framework with automated test execution — pulling a large class of coding tasks into strongly verifiable territory. By 2025, process reward models (PRMs) could score intermediate reasoning steps, not just final outputs. By 2026, cross-agent consensus mechanisms and multi-model judge panels were scoring tasks that no single automated check could verify.

Figure 5 — How Code Quality Verification Became Scalable

~2020

Senior Engineer Review

Manual code review. Hours per PR. $150+/hr. Subjective quality judgment. Cannot scale beyond team size.

~2022

Linters + Static Analysis

Semgrep, ESLint, type checkers. Instant, free, but limited to rule-expressible patterns. Style and logic bugs still require humans.

~2024

SWE-bench + Test Execution

Standardized evaluation. Automated test suites as verifiers. 2,294 real GitHub issues with pass/fail signals. Pulled functional correctness into strong verifiability.

~2025

Process Reward Models (PRMs)12

Score intermediate reasoning steps, not just final output. Enable RL over multi-step coding trajectories. The verifier now sees the reasoning, not just the result.

~2026

Cross-Agent Consensus + Multi-Model Judges

Agent panels evaluate design decisions, architectural quality, security implications. Scalable, ~80-90% agreement with expert review. Previously non-verifiable judgment now weakly verifiable.

Each step in this progression expanded the verification fidelity for code quality. The consequence was direct and measurable: SWE-bench Verified pass rates went from 3.8% (GPT-4, April 2024) to over 74% (frontier agents, early 2026) — roughly a 20x improvement in under two years. The verifier unlocked the improvement.

The evidence trail is now extensive enough to treat this as a law rather than a hypothesis. Harvey's legal fine-tune — 20 billion tokens, 97% attorney preference over GPT-4, absorbed within 14 months by frontier reasoning models — is the canonical case. We will examine it in detail in Part III. The pattern is consistent: fine-tunes are features with shelf lives, not flywheels with compounding returns.

The scaffolding compression story is even faster. Chain-of-thought prompting — the technique that kickstarted the reasoning revolution in 2023 — is now actively harmful on reasoning models. Asking o3 or Claude Opus to "think step by step" before using tools degrades performance, because the model already reasons internally and the external prompt creates interference. SWE-agent's Agent-Computer Interface, which added 10.7 percentage points to SWE-bench in April 2024 via 13 custom commands, was proven unnecessary by the SWE-agent team's own mini-swe-agent: 100 lines of Python, raw bash, no custom commands, 74%+ on SWE-bench Verified. The cognitive scaffolding was absorbed into the model weights within 10 months.

The Labs' Strategy, in One Sentence

Build better verifiers. Expand verification fidelity. Train. Absorb the capability. The entire investment in RL infrastructure is an investment in moving the verifiability frontier leftward, domain by domain.

For vertical AI companies, this creates an existential clock. If your value comes from a capability that sits in the strongly verifiable zone, the foundation model will absorb it. The speed of absorption is determined by exactly one variable: how quickly the labs can build (or buy) a verifier for your domain, and whether they can access sufficient training data to run against it. The $29 billion Scale AI valuation and the $1B+ lab RL budgets tell you how seriously this absorption program is being executed.

Section 5

Every Domain Is a Topology

The common mistake — made by founders, investors, and analysts with equal frequency — is to treat a domain as a monolith. "Legal AI." "Healthcare AI." "Accounting AI." As if the entire domain sits at one point on the verifiability spectrum. It does not. A domain is a landscape with peaks and valleys of verifiability. The intelligent company does not plant its flag on the domain label. It maps the topology, decomposes the domain into its verifiability zones, and builds differently in each.

But how do you classify a task? We use a simple litmus test. For any task in any domain, ask these questions in order — the first "yes" determines the zone:

The Verification Litmus Test

Test	Zone
Can a deterministic program check correctness in under one second? (compiler, test suite, balance sheet, regex match, code lookup)	Strongly Verifiable
Can a domain-trained evaluator or structured rubric score it with consistent expert agreement? (judge model, checklist, style guide, peer review rubric)	Weakly Verifiable
Can correctness only be observed from downstream outcomes, with days-to-months of delay? (revenue impact, claim denial rate, user retention, patient readmission)	Inferable
Do qualified domain experts regularly disagree on what "good" looks like? (strategy, taste, judgment under genuine uncertainty)	Non-Verifiable

The zone is determined by the strongest available verification method. A task is strongly verifiable if any deterministic checker exists, even if human judgment could also evaluate it. The litmus test biases toward the most scalable signal.

Figure 6 — Domain Topology: Three Verticals Decomposed

Legal (Harvey, $195M ARR)

Document extraction & clause identification

Strongly Verifiable

Citation verification & cross-reference checking

Strongly Verifiable

Contract drafting & redlining

Weakly Verifiable

Legal research & memo generation

Weakly Verifiable

Risk assessment & due diligence synthesis

Inferable

Litigation strategy & settlement negotiation

Non-Verifiable

Accounting (Campfire, $375M valuation)

Transaction reconciliation & balance verification

Strongly Verifiable

Bank feed matching & duplicate detection

Strongly Verifiable

Transaction categorization & GL coding

Weakly Verifiable

Revenue recognition (ASC 606)

Weakly Verifiable

Company-specific judgment calls (CFO accept/reject)

Inferable

Financial strategy & tax optimization

Non-Verifiable

Healthcare (Abridge, 1.5M encounters)

Medical transcription accuracy (WER)

Strongly Verifiable

ICD-10 / CPT code assignment

Strongly Verifiable

Clinical note generation & structuring

Weakly Verifiable

Differential diagnosis suggestion

Weakly Verifiable

Treatment plan adherence & outcome prediction

Inferable

Clinical judgment & patient communication style

Non-Verifiable

No domain is a monolith. Every vertical contains tasks spanning the entire verifiability spectrum. The smart company maps this topology and builds differently in each zone.

The pattern is consistent across verticals. Every domain decomposes into a gradient from strongly verifiable (machine-checkable, scalable) to non-verifiable (human judgment, no automated signal). The tasks at the top of each list — document extraction, transaction reconciliation, transcription accuracy — are already being absorbed by foundation models. The tasks at the bottom — litigation strategy, financial strategy, clinical judgment — will not be absorbed because no reward signal can be constructed for them at scale.

The middle is where the war is fought. Contract drafting, categorization, clinical note generation — these are weakly verifiable. A judge model or a set of rules can score them with 80% accuracy. Good enough for RL. Good enough for the labs to build training loops around. And the verifier technology is improving every quarter, which means this middle zone is expanding leftward, pulling more tasks into the absorption window.

Figure 7 — The Domain Topology Map: Harvey's Legal Domain

Harvey's legal domain as a topology. The y-axis represents how far a task sits above the foundation model's capability waterline. Document extraction and citation checking are submerged — already absorbed. Contract drafting sits at the waterline, contested. Litigation strategy towers above it, safe from absorption because no reward signal exists. The waterline is rising. Harvey survived its fine-tune moment by pivoting from model advantage (submerged) to workflow orchestration and evaluator ownership (above the waterline).

Smart companies decompose their domain along this topology and adopt three distinct strategies simultaneously:

Green Zone Strategy: Train Aggressively

For strongly verifiable tasks (document extraction, reconciliation, transcription), build production signal loops. Every transaction is a label. Every accept/reject is free training data. Campfire collects every accounting transaction at full fidelity; the result is 95%+ automation accuracy. Cursor gets 400 million labeled examples per day from code accept/reject decisions. The moat here is not the model — it is the rate of training signal generation. You cannot out-train someone with 400M daily labels.

Yellow Zone Strategy: Own the Evaluator

For weakly verifiable tasks (drafting, categorization, clinical notes), the durable asset is the evaluation infrastructure. Harvey's 18,000 legal workflows are not prompts; they are structured evaluators that define what "correct" looks like for each task type. Abridge's 1.5 million medical encounters produced a proprietary ASR with 16% lower word error rate than Whisper and 45% fewer medical errors. The verifier is upstream of the model. Whoever controls the reward signal controls the rate of model improvement.

Red Zone Strategy: Wrap Human Judgment

For non-verifiable tasks (strategy, negotiation, clinical judgment, creative direction), keep humans in the loop and build tools that amplify their judgment rather than replace it. This is not a temporary concession — it is a permanent architectural choice. No RL loop can train on "was this the right litigation strategy" because the ground truth takes years to materialize and cannot be attributed to a single decision. The value here is in the quality of the human-AI collaboration interface, not the AI's autonomous capability.

The topology map is the strategic document. It tells you, with uncomfortable precision, where your moat is and where it isn't.

Harvey understood this after its fine-tune was absorbed. The company pivoted from model advantage to workflow orchestration — Vault, multi-model routing, 18,000 structured workflows. Its $195M ARR and $8-11B valuation survived the absorption event because it had already built durably in the yellow and red zones. The model advantage evaporated. The system advantage persisted. The topology was the map that showed where to stand.

The Question Every Vertical AI Company Must Answer

Where does your domain sit on the topology map? Which of your capabilities are in the green zone (absorption imminent), the yellow zone (contested, invest in evaluators), or the red zone (safe, invest in human collaboration)? If you cannot answer this with specificity — not at the domain level but at the task level — you do not know whether your business compounds or compresses.

· · ·

Part III

Five Topologies

The topology map from Part II is a framework. Frameworks are only useful if they survive contact with real companies. This section applies the map to five companies — each occupying a different position on the verifiability landscape, each navigating it differently. The goal is not to rank them but to show how the same analytical instrument produces distinct strategic readouts depending on the domain's structure.

Section 6

Cursor — The Maximum Fidelity Position

Verification Heuristic — Code

LitmusDoes it compile? Do tests pass? Did the user accept the suggestion?

Primary verifierCompiler + test suite + implicit user feedback (tab = accept, escape = reject)

Signal density~400M labels/day · 90-minute retrain cycles · zero marginal cost per label

Cursor is the clearest existence proof that verification fidelity determines everything. The company occupies the single highest-fidelity position in the entire AI application landscape: code generation and editing. Almost every interaction generates a training signal. Every tab completion is accepted or rejected. Every function is compiled or fails to compile. Every test passes or fails. The signal is instant, free, and nearly perfect.

The numbers are staggering. More than 400 million accept/reject decisions per day, each one a labeled training example generated at zero marginal cost. Retrain cycles running every 90 minutes. Over 2.5 million developers on the platform. $300 million in annual recurring revenue. And increasingly, Cursor is using this signal to train its own models — Composer 2 hit 61.7% on Terminal-Bench at one-tenth the inference cost of frontier models. The production signal loop is not a moat supplement. It is the moat.

The topology of the code domain explains why. Almost everything a developer does in a code editor is strongly verifiable. The code compiles or it does not. The tests pass or they do not. The user hits tab or presses escape. There is no ambiguity, no need for a learned judge, no latency between action and evaluation. The verification fidelity per interaction is maximized.

Figure 8 — Cursor's Domain Topology: Almost Entirely Submerged

Cursor's domain is almost entirely submerged beneath the foundation model capability line. The vast majority of coding tasks — tab completion, function generation, bug fixing, test writing — are strongly verifiable. Only a sliver of the domain (architecture decisions, code style) sits at the waterline, and an even thinner sliver (system design philosophy) rises above it. This is why coding AI is further ahead than any other vertical: maximum verification fidelity produces maximum rate of improvement.

The small yellow zone — code style preferences, architectural decisions about whether to use composition versus inheritance, which abstraction pattern to reach for — is weakly verifiable. A learned judge or senior engineer can score it, but the signal is noisy and slow. The tiny red zone — system design philosophy, whether a monolith or microservices is the right choice for this team at this stage — is genuinely non-verifiable. There is no ground truth. Reasonable engineers disagree.

But the ratio matters enormously. By surface area, more than 90% of what happens in Cursor's domain is in the green zone. This means the production signal loop operates at a scale no competitor can replicate without equivalent distribution. The math is instructive: at 400 million labeled examples per day, the equivalent data purchase from RL environment companies would cost labs somewhere between $80 billion and $800 billion per year. No one is paying that. The signal is an asset that only Cursor's installed base generates.

The Insight

Cursor occupies the maximum fidelity position: almost all tasks are strongly verifiable, retrain cycles are 90 minutes, and every user interaction is a label. This is why coding is the furthest-ahead AI vertical. It is not because code is easy. It is because code is scoreable.

Section 7

Harvey — The Pivot Case Study

Verification Heuristic — Legal

LitmusDoes the citation exist in the source document? Does the clause match the contract language?

Primary verifierCitation checker (green) + BigLaw Bench structured rubrics (yellow) + partner review (red)

Signal density18,000 structured workflows generating labeled attorney corrections

Harvey is the most instructive case in the landscape because it survived an absorption event. In 2024, Harvey fine-tuned a model on 20 billion tokens of legal text. Attorneys preferred it 97% of the time over GPT-4. It was a genuine technical moat — a model that understood legal reasoning in a way no general-purpose system could match. It lasted 14 months.

Frontier reasoning models — o3, GPT-5, GPT-5.1 — absorbed the cognitive legal capability that Harvey had painstakingly distilled. Seven models, including three from non-OpenAI labs, surpassed Harvey on its own benchmark, BigLaw Bench. The fine-tune was not a flywheel. It was a feature with a shelf life. The foundation model waterline rose, and everything Harvey had built below it got submerged.

What Harvey did next is the lesson. The company pivoted from model advantage to evaluator ownership. Vault, the multi-model routing layer. 18,000 structured legal workflows. BigLaw Bench itself — the evaluation framework that defines what "correct" looks like for legal tasks. The pivot was not from one product to another. It was from one zone on the topology map to another. Harvey moved uphill, from the green zone that was being absorbed to the yellow and blue zones where the model still needs external structure to perform.

Figure 9 — Harvey's Domain Topology: The Pivot Uphill

Harvey's pivot mapped onto the topology. The dashed arrow traces the strategic migration: from the green zone (fine-tuned model advantage on document extraction and citation checking, absorbed by frontier models in 14 months) uphill to the yellow and blue zones (18,000 structured legal workflows, Vault multi-model routing, BigLaw Bench). The model advantage evaporated. The evaluator advantage persists. $195M ARR and $8-11B valuation survived the absorption event.

The topology explains both why the absorption happened and why the pivot worked. Harvey's original fine-tune lived in the green zone — the cognitive legal reasoning it distilled was, by definition, verifiable enough that labs could train on equivalent signal. The law libraries are public. The case outcomes are on record. The evaluation criteria (did the attorney prefer this output?) are learnable by any model with enough data. Green zone capabilities get absorbed. That is the rule, not the exception.

But Harvey's 18,000 workflows live in the yellow zone. Each workflow encodes a specific legal process — not just "draft a contract" but "draft a Delaware Series A preferred stock purchase agreement with the following carve-outs for the following jurisdiction, reviewed against these precedents." That specificity is the evaluator. It defines what "correct" means at a granularity that no general-purpose model can infer from public training data. BigLaw Bench is the meta-evaluator — an evaluation framework for legal AI that Harvey controls. The verifier is upstream of the model. The model needs Harvey's evaluator to improve at legal tasks. Harvey does not need any single model.

The Insight

Harvey survived absorption by pivoting uphill on the topology map. The durable position is not "we have a better model for legal." It is "we define what correct looks like for 18,000 legal workflows, and any model that wants to improve at legal tasks needs our evaluator." The verifier is upstream of the model. Own the verifier.

Section 8

Sierra — The Constellation Architecture

Verification Heuristic — Customer Experience

LitmusDid the customer issue resolve? Was the action policy-compliant? Did CSAT improve?

Primary verifierResolution binary (green) + policy compliance checker (yellow) + outcome-based pricing ($1-8/resolution)

Signal density2M+ conversations/month · every resolution = verified positive, every escalation = verified negative

Sierra occupies a structurally unusual position on the verifiability landscape. Its domain — customer experience and support — has the highest harness ceiling of any vertical we have examined. The reason is architectural: conversations are mostly independent. Each customer interaction is a self-contained episode with a clear start, a resolution (or not), and a measurable outcome. There is minimal cross-conversation compositionality. The harness does not degrade at scale because the scale is horizontal, not vertical.

The numbers confirm the structural advantage. More than 2 million conversations per month across Sierra's customer base. Outcome-based pricing at $1-8 per resolution, aligning the business model directly with the signal loop — Sierra only gets paid when the conversation resolves, which means every successful resolution is a verified positive signal and every escalation is a verified negative signal. Estimated ARR exceeding $500 million.

The topology of customer experience decomposes cleanly into zones, but the interesting structural feature is the relationship between the zones and Sierra's architecture. Sierra does not deploy a single monolithic agent. It deploys a constellation — specialized agents per domain, each tuned to a different slice of the verifiability landscape.

Figure 10 — Sierra's Constellation Architecture Mapped to Verifiability Zones

Sierra's constellation architecture maps directly onto the verifiability topology. Green-zone agents (intent classification, FAQ routing, order status) are fully automated and strongly verifiable. Yellow-zone agents (multi-turn resolution, escalation decisions) use outcome-based scoring. Red-zone tasks (brand voice, empathy calibration) require per-client tuning that no foundation model can absorb. Each specialist agent handles one verifiability zone. The outcome-based pricing model ($1-8 per resolution) converts the business model itself into a verified signal loop.

The mapping is precise. Green-zone agents — intent classification, FAQ routing, order status lookup — are strongly verifiable. The customer either got the right information or did not. These agents run at near-total automation. Yellow-zone agents — multi-turn problem resolution, escalation decisions — are weakly verifiable. The resolution was satisfactory or it was not, but the judgment is noisy and the optimal path through a five-turn conversation is not deterministic. These agents learn from outcome data: was the conversation resolved, did the customer come back, was there a follow-up escalation? Blue-zone signals come from aggregate patterns — resolution rates, CSAT trends, repeat contact frequency — that take time to materialize but are real.

The red zone is where Sierra's per-client customization becomes a moat. Brand voice is non-verifiable. There is no machine-checkable criterion for whether a response "sounds like Sonos" versus "sounds like WeightWatchers." Empathy calibration — how aggressive to be on a refund, when to use humor, how to handle an angry customer — is taste, not science. Each client's red zone is different, and the tuning requires human collaboration that no foundation model can replicate from public training data.

The Insight

Sierra's domain has the highest harness ceiling because conversations are self-contained episodes. The constellation architecture exploits this: each specialist agent maps to a verifiability zone, and the outcome-based pricing model converts every conversation into a verified signal. The result is a flywheel that operates within the harness ceiling rather than fighting against it.

The vulnerability is the mirror image of the strength. If conversational AI becomes a commodity — if foundation models can handle multi-turn support conversations natively — Sierra's high harness ceiling becomes a flat floor. The defense is outcome-based pricing: as long as Sierra owns the measurement of resolution quality, the model behind it is interchangeable. The risk is that enterprise buyers decide the model is the product and the orchestration layer is overhead.

Section 9

Campfire — The Full-Stack Candidate

Verification Heuristic — Accounting

LitmusDo debits equal credits? Does the GL code match GAAP? Did the accountant accept the categorization?

Primary verifierDouble-entry balance (green) + GAAP/IFRS rule system (green) + accountant reclassification (yellow)

Signal densityEvery transaction = a label · every accountant correction = perfect training signal · 95%+ automation accuracy

Campfire is the closest any company in our sample comes to owning the entire verifiability stack — from the raw production signal at the bottom to the system of record at the top. In accounting, the green zone is massive, the labels are perfect, and every single transaction generates a training signal at full fidelity. This is the full-stack play: own the evaluator, collect the production signal, and build the system of record that makes switching prohibitively expensive.

The accounting domain has a structural advantage that most other verticals lack. The majority of bookkeeping tasks are strongly verifiable by construction. A bank reconciliation either balances or it does not. A duplicate transaction either matches or it does not. Debits equal credits, or they do not. These are not fuzzy heuristics — they are mathematical identities enforced by five centuries of double-entry bookkeeping. GAAP and IFRS are codified rule systems. The evaluator is not learned; it is legislated.

Campfire exploits this ruthlessly. 95%+ automation accuracy on bookkeeping tasks. Every transaction collected at full fidelity — amount, counterparty, GL code, client correction if any. The client correction is the gold: when an accountant reclassifies a transaction, that reclassification is a perfect label. "The model said rent expense; the human said marketing expense." Free, instant, unambiguous training signal. At scale, this is a production signal loop equivalent in structure (though not in volume) to Cursor's accept/reject stream.

Figure 11 — Campfire's Domain Topology: The Dominant Green Zone

Campfire's domain topology shows a dominant green zone. Reconciliation, bank matching, and duplicate detection are strongly verifiable — every task generates a perfect label. The yellow zone (categorization, revenue recognition) is narrower. The red zone (tax strategy, financial planning) is small but permanent. The structural advantage: accounting's evaluator is not learned — it is legislated by GAAP. Every green-zone transaction that generates a perfect label feeds the production signal loop.

The vulnerability is equally visible on the topology. Campfire's large green zone means a large surface area that foundation models can and will absorb. Balance sheet balancing is strongly verifiable — labs will train on it. Transaction categorization is weakly verifiable with rules-based + LLM-judge scoring — frontier models already handle standard cases. Anthropic has already launched a finance Cowork plugin. The non-authoritative layer is being commoditized.

The durability comes from three non-AI positions. First, the system of record. Once Campfire is the general ledger, switching cost is massive — the NetSuite dynamic, where lock-in persists for 25 years despite mediocre product. This is SaaS gravity, not AI moat, but it compounds. Second, SOC certification and regulatory compliance. Labs will not get SOC certified for every vertical. Third, and most critically, the production signal loop at full fidelity. Every transaction, every correction, every GL reclassification is a labeled training example that only Campfire's installed base generates. The question is whether that signal volume is sufficient to sustain a model advantage or whether frontier models will absorb standard accounting patterns from public data faster than Campfire can specialize on its proprietary corrections.

The Insight

Campfire's domain has the largest green zone of any company in this analysis, which is simultaneously its greatest strength (perfect labels at full fidelity, legislated evaluator) and its greatest vulnerability (maximum surface area for absorption). Survival depends on whether the full-stack play — system of record + SOC compliance + production signal loop — compounds faster than foundation models absorb standard accounting patterns.

Section 10

Abridge — The Proprietary Verifier

Verification Heuristic — Healthcare Documentation

LitmusDoes the transcript match the audio? Does the billing code match the encounter? Did the clinician sign off?

Primary verifierProprietary ASR engine (green, 16% better WER than Whisper) + ICD-10/CPT lookup (green) + clinician approval (yellow)

Signal density1.5M encounters · each clinician sign-off = verified note · claim denial rate = downstream outcome signal

Abridge is the purest example of the evaluator-becomes-model path. The company started by solving a verification problem — accurate medical transcription — and used the verifier it built to generate a dataset that no lab can replicate. 1.5 million medical encounters. A proprietary ASR engine that outperforms OpenAI's Whisper by 16% on word error rate. 45% fewer medical errors in clinical documentation. These numbers are not incremental improvements. They represent a structural data advantage with regulatory and distributional barriers that make replication prohibitively difficult.

The topology of healthcare documentation decomposes along a characteristic gradient. At the bottom — the strongly verifiable green zone — sit transcription accuracy and medical coding. Word error rate is a number. ICD-10 and CPT codes are either correct or incorrect against a known standard. These tasks have perfect verifiability: instant, free, and unambiguous. Abridge's 16% WER advantage over Whisper means it built a better verifier for the most fundamental healthcare documentation task, and that verifier advantage propagates upward through the stack.

The propagation mechanism matters. A more accurate transcript feeds a better clinical note generator. A better clinical note generator produces more accurate ICD-10 codes. More accurate codes generate fewer claim denials, which produces a measurable outcome signal (revenue cycle impact) that reinforces the entire loop. Each layer's verifier is an input to the layer above it. The green zone does not just generate training signal for the green zone — it generates the foundation that makes the yellow zone tractable.

Figure 12 — Abridge's Domain Topology: The Proprietary Verifier Layer

Abridge's proprietary verifier layer (dashed box) sits beneath the entire stack. The 1.5M medical encounters trained an ASR engine that outperforms Whisper by 16% on word error rate — this is not a model fine-tune on public data but a proprietary verifier built on clinical data that no lab can access without HIPAA-compliant distribution into 55+ health systems. The verifier's signal propagates upward: better transcription feeds better clinical notes, which feed more accurate coding, which feed measurable revenue cycle outcomes. The EHR integrations (EPIC, Oracle Health) are the distribution footprint that makes the data flywheel turn.

The key to understanding Abridge's durability is the distinction between the model and the verifier. Abridge did not merely fine-tune a model on medical data — the Harvey path, which ended in absorption. It built a verifier — a proprietary ASR engine — that generates the ground truth against which models are scored. The 16% WER improvement over Whisper is not a model advantage. It is a verification advantage. It means Abridge's training labels are 16% more accurate than what any lab can produce using off-the-shelf speech recognition. Every model trained on Abridge's labels starts from a higher floor.

The distribution moat reinforces the data moat. EPIC and Oracle Health integrations cover over 55 health systems. Each integration took months of compliance work — HIPAA business associate agreements, SOC 2 certification, clinical workflow integration testing. A lab that wanted to replicate Abridge's dataset would need to replicate its distribution footprint first. This is not a six-month project. It is a multi-year campaign requiring sales, compliance, and clinical partnership infrastructure that no foundation model lab has built or has incentive to build for a single vertical.

The result is the evaluator-becomes-model path in its cleanest form. Abridge's proprietary verifier generates a dataset. The dataset trains a model. The model produces better clinical documentation. Better documentation generates more accurate labels. The verifier improves. The cycle compounds. And at every turn, the data is physically inaccessible to labs — protected by HIPAA, locked behind EHR integrations, generated only in clinical encounters that Abridge's distribution makes possible.

The Insight

Abridge built a proprietary verifier, not just a proprietary model. The 1.5M medical encounters and 16% WER advantage over Whisper represent a verification moat — the ground truth against which all medical transcription models are scored. The EHR integrations create a distribution footprint that makes the dataset physically inaccessible to labs. This is the evaluator-becomes-model path: the verifier generates the data, the data trains the model, the model generates better data. The cycle compounds behind a distribution wall.

· · ·

Part IV

The Survival Positions

Section 11

The Full-Stack Play

The three survival positions from Part III — own the checker, own the domain reward surface, own the production signal loop — are individually valuable. But the emerging thesis is stronger: what happens when a single company owns all three?

The full-stack play is the logical endpoint of everything this essay has argued. If verification fidelity is the mechanism by which models absorb capabilities, and if the verifiability frontier determines what gets absorbed and when, then the company that controls the entire learning stack — the evaluator that generates the reward signal, the production environment that generates labeled data, and the model that is trained on that signal — has achieved something the foundation model labs cannot replicate from the outside: a closed-loop system that improves itself on its own domain, at its own pace, on its own terms.

Three components make this work.

1. Own the Evaluator

Not "we have evals" in the sense of a benchmark suite that runs on Saturdays. Own the verification infrastructure that defines what "correct" means in your domain. Harvey built BigLaw Bench — 18,000 structured legal workflows encoding the judgment of top-tier attorneys about what constitutes a good contract clause, a complete legal memo, an accurate citation. Abridge built proprietary ASR evaluation against 1.5 million medical encounters with physician corrections. Campfire encodes GAAP rules as machine-checkable constraints against every transaction. These are not test suites. They are domain-specific reward surfaces that no one else can construct because the data to build them comes from being embedded in the domain.

2. Open-Source Foundation Model

Use Llama, Mistral, DeepSeek, or Qwen as the base. This is not a quality concession — Llama 4 Maverick and DeepSeek-R1 are competitive with frontier proprietary models on most benchmarks. The purpose is model independence. If your training loop runs on Claude, Anthropic can observe your signal, restrict your access, or launch a competing product. If it runs on Llama, Meta has no visibility into your domain-specific post-training and no mechanism to absorb what you have learned. The foundation model becomes a commodity input — interchangeable, upgradable, non-threatening — rather than a dependency that gives your upstream supplier a map of your moat.

3. Production Signal Loop

Collect production data at full fidelity. Every user interaction, every accept/reject decision, every correction, every outcome. This is not logging — it is the systematic generation of training signal at a density that labs cannot replicate without being embedded in the domain themselves. Cursor's 400 million daily accept/reject signals. Abridge's physician corrections on every clinical note. Campfire's CFO approvals on every categorization. The signal is free, labeled by usage, and domain-specific in a way that no synthetic data pipeline can match.

The Full-Stack Thesis

If you own all three — evaluator, open-source base model, production signal loop — you can post-train your own model on your own signal through your own evaluator. You become model-independent. The foundation model becomes a commodity input, not a dependency. The labs spend $1B+/year on RL. You do not need to match that spend in general capability. You only need to exceed it in your domain's yellow and red zones.

Figure 13 — The Full-Stack Architecture

The full-stack architecture is a closed loop. The domain evaluator defines correctness. Production signal provides labeled data. Post-training (RL on your own signal, scored by your own evaluator) updates the open-source base model. The resulting domain model produces better outputs, which generate higher-quality production signal, which feeds the next training cycle. Each revolution tightens the loop.

The evidence is accumulating that this architecture works. Harvey routes between GPT-4, Claude, and internal models — it is already model-independent in practice, treating each foundation model as a modular capability source. Cursor trained Composer 2 on production accept/reject data, achieving 61.7% on Terminal-Bench at one-tenth the cost of frontier models. They did not need to match GPT-4 on everything. They needed to beat it on code completion, the domain where they had the densest production signal.

Campfire collects every accounting transaction at full fidelity. They own the GAAP evaluator. They have the production signal loop. The missing piece is post-training on an open-source base — and when they close that loop, they will have a domain model for accounting that improves at a rate determined by their transaction volume, not by Anthropic's RL budget.

The counter-argument is obvious: the labs are spending $1B+ per year on RL environments and infrastructure. Can any startup match that intensity? The answer depends entirely on domain specificity. You do not need to match the labs' investment in general capability. You need to exceed their investment in your domain's yellow and red zones. The labs are spreading their $1B across every domain. You are concentrating your $10M on one. If your domain is narrow enough and your signal is dense enough, the math works.

Recent work on feature-constrained LoRA makes the math work even better.14 The core insight: you do not need to update all of a model's weights to adapt it to a domain. By projecting updates onto interpretable feature directions within the model — the specific internal representations that correspond to meaningful behavioral changes — you can achieve domain adaptation at a fraction of the compute cost. Rank-32 LoRA on Qwen3-8B matched full supervised fine-tuning at 9x lower cost. Cursor already ships a new Composer checkpoint every five hours via continuous RL on production data. The full-stack play does not require lab-scale compute. It requires lab-quality signal — which is exactly what evaluator ownership and production data density provide.

This reframes the accessibility of the full-stack architecture. The bottleneck was never compute — it was the reward signal. Feature-constrained adaptation means that a startup with dense, domain-specific production signal and a strong evaluator can post-train an open-source model on a few GPUs. The barrier to model independence is not hardware. It is the quality and density of your training loop.

The Strategic Implication

The full-stack play is not for every vertical AI company. It requires evaluator depth, production signal density, and the engineering team to run post-training pipelines. But the compute barrier is lower than it appears — feature-constrained LoRA and efficient on-policy learning mean the prize is reachable without lab-scale infrastructure. For the companies that can execute it, the result is genuine model independence: the ability to improve at a rate the labs cannot dictate or compete with, because the signal, the evaluator, and the training loop are all proprietary.

Section 12

The Fundamental Questions

Everything in this essay rests on claims about what models can and cannot learn, how fast they absorb capabilities, and where the ceilings are. Honest analysis requires stating what we do not know. Five open questions determine whether the survival positions hold — and on what timescale.

1. RL vs. SFT for Domain Adaptation

Shenfeld et al. (2025) demonstrated something that should change how every vertical AI company approaches model customization: on-policy RL retains 93% accuracy on prior tasks while supervised fine-tuning degrades catastrophically. The mechanism is intuitive once you see it. SFT directly overwrites the weight configuration that encoded prior capabilities. RL, by contrast, learns through exploration — the model tries different approaches, receives reward signal, and adjusts its policy. The policy update preserves the structure of existing capabilities because RL optimizes the objective without requiring the gradient to overwrite specific memory locations.

The implication for vertical AI is direct: domain post-training should use RL, not SFT. But RL requires a reward signal. And a reward signal requires an evaluator. This closes the loop back to the central thesis: companies need evaluators (to produce reward signals for RL) more than they need datasets (which only enable SFT). The company that has 10,000 labeled examples but no evaluator can only do SFT and will suffer catastrophic forgetting. The company that has an evaluator can generate reward signal dynamically and train with RL, preserving the base model's general capabilities while adding domain-specific ones.

2. Scaffold Absorption Half-Lives

Every layer of the AI stack has an empirically observable half-life — the time before a foundation model update absorbs its functionality.

Observed Half-Lives

Wrappers (prompt templates, thin UI): 4–6 months. Jasper's collapse from $120M to $35M ARR after ChatGPT launched is the canonical case.
Cognitive scaffolding (chain-of-thought prompts, tool orchestration): 6–12 months. SWE-agent's 13 custom commands were absorbed in 10 months.
Fine-tuned models: ~14 months. Harvey's legal fine-tune was surpassed by seven frontier models on its own BigLaw Bench within 14 months of deployment.
Evaluation infrastructure: Unknown, but estimated at years. No evidence of labs absorbing domain-specific evaluation capabilities.

The half-life determines the investment horizon. If you are building at the wrapper layer, your product must generate enough revenue and lock-in within 4–6 months to survive the next model release. If you are building evaluation infrastructure, the half-life is long enough to compound data advantage across multiple model generations. The question is whether these estimates hold as labs accelerate their RL investment. The half-life of cognitive scaffolding may already be compressing from 12 months toward 6.

3. The Compositionality Ceiling

Section 1 established that parametric compositionality — the ability of neural representations to recombine automatically across domains — is what makes foundation models powerful. Section 3 established that harness-based memory hits a ceiling on this compositionality. The open question is: how hard is the ceiling?

Jessy Lin et al. (2025) showed that sparse continual learning — updating only a small fraction of parameters per task — reduces forgetting from 89% to 11%. Promising. But is the residual 11% forgetting tolerable? In many domains, yes. An 11% degradation in general reasoning may be an acceptable price for deep domain specialization. In others — particularly safety-critical domains like avionics (DO-178C) or medical devices (FDA Class III) — 11% forgetting could be disqualifying. The answer is domain-specific, which means the compositionality ceiling is not one number but a landscape, varying by the error tolerance of the domain.

4. Knowledge Overwriting vs. Accumulation

When you fine-tune a model on domain data, what happens to its general capabilities? This is the stability-plasticity tradeoff, and three decades of research have not resolved it.

Elastic Weight Consolidation (Kirkpatrick 2017) protects important weights by adding a regularization term that penalizes changes to parameters deemed critical for prior tasks. It works for small task sequences. It does not scale to hundreds of sequential tasks. Progressive neural networks (Rusu 2016) sidestep overwriting entirely by adding new capacity for each task and using lateral connections to transfer knowledge. This avoids forgetting by construction but has a linear cost: model size grows with every new domain.

Neither approach is production-viable at the scale that vertical AI companies need. The practical solution today is the one Harvey converged on: multi-model routing. Instead of post-training a single model to handle all tasks, route each query to the model best suited for it. This avoids overwriting entirely but introduces latency and coordination complexity. Whether parametric solutions will eventually replace routing is an open question — and the answer determines the long-term architecture of every full-stack play.

5. Data Efficiency

How much domain data do you actually need? The reflexive assumption is "more is better," and production signal density is a moat. But the evidence is more nuanced. Microsoft's Phi-1 demonstrated that a small model trained on high-quality "textbook" data can outperform models trained on orders of magnitude more internet data. The principle — "textbooks are all you need" — implies that curation matters more than volume.

For vertical AI, this has a counterintuitive implication: curated evaluation data may matter more than raw production volume. A company with 1,000 expert-validated examples and a precise evaluator may produce better post-training results than one with 100 million noisy production signals. This does not invalidate the production signal flywheel — but it suggests that the evaluator's quality, not the signal's volume, is the binding constraint. Which, again, closes the loop back to owning the evaluator as the primary survival position.

Figure 14 — Five Fundamental Questions: Current Evidence and Implications

Question	Current Evidence	Implication	Confidence
RL vs. SFT for domain adaptation	RL retains 93% on prior tasks; SFT degrades catastrophically (Shenfeld 2025)	Evaluators > datasets. Companies need reward signals, not just labeled examples.	High. Replicated across multiple domains.
Scaffold absorption half-lives	Wrappers: 4–6mo. Scaffolds: 6–12mo. Fine-tunes: ~14mo. Evals: years.	Investment horizon determined by the layer you build on. Eval layer is the most durable.	Medium. Based on 3–4 data points per layer. Could compress.
Compositionality ceiling	Sparse CL reduces forgetting from 89% to 11% (Lin 2025)	Residual forgetting is domain-dependent. Tolerable in most verticals, disqualifying in safety-critical.	Medium. Promising direction but unproven at production scale.
Knowledge overwriting	EWC and progressive nets offer partial solutions. Multi-model routing is the current practical answer.	Full-stack plays may require routing, not a single model. Parametric CL would change this.	Low. Stability-plasticity tradeoff fundamentally unsolved at scale.
Data efficiency	Phi-1: quality > quantity. "Textbooks are all you need."	Evaluator quality, not signal volume, may be the binding constraint on post-training.	Medium. Demonstrated for pre-training; less clear for domain post-training.

Four of five questions point to the same conclusion: the evaluator is the critical asset. RL requires evaluators. Data efficiency favors curated evaluation data over raw volume. The overwriting problem is unsolved, but evaluator-driven routing offers a practical workaround. The only question that does not directly implicate evaluators — scaffold half-lives — instead tells you how long each layer survives before the evaluator layer is all that remains.

The pattern across all five questions is remarkably consistent. Each one, from a different angle, converges on the same conclusion: the evaluator is the most durable component in the stack. RL requires it. Data efficiency depends on it. Routing is governed by it. Scaffold absorption spares it longest. This is not coincidence. It is a consequence of the underlying mechanism. The evaluator is the reward surface. Everything else — the model, the harness, the scaffolding — is optimized against that surface. The surface persists. The things optimized against it do not.

Section 13

The Investor Playbook

The preceding nine sections build a framework. This section distills it into five questions that determine whether a vertical AI company compounds or compresses. Every question maps directly to a structural mechanism described earlier. None are cosmetic. Skip any one, and you will miss the variable that determines whether the business survives the next model release.

Question 1: What Does the Domain Topology Look Like?

Map the domain into verifiability zones at the task level, not the domain level. How much is green (strongly verifiable, will be absorbed by foundation models), how much is yellow (weakly verifiable, contested territory where evaluators determine who wins), and how much is red (non-verifiable, safe from absorption because no reward signal exists)?

A company whose value concentrates in the green zone is building on a melting iceberg. A company that is mostly red has a durable position but a small addressable market. The best companies have a distribution weighted toward yellow and red, with enough green to generate production signal that feeds the learning loop.

Question 2: Do They Own the Evaluator?

Not "do they have evals" but do they own proprietary verification infrastructure that defines what "correct" means in their domain? Harvey's BigLaw Bench. Abridge's ASR trained on 1.5 million physician-corrected encounters. Campfire's GAAP rules encoded as machine-checkable constraints. Semgrep's static analysis rules as an independent verification layer for code. The evaluator is upstream of the model. Whoever controls the reward signal controls the rate of improvement.

A company that uses off-the-shelf evals — LLM-as-judge, general benchmarks, manual spot checks — has no moat in the evaluation layer. The labs can replicate those evals trivially. A company that has built domain-specific verification infrastructure from production data owns something the labs cannot build without being embedded in the domain themselves.

Question 3: What Is Their Production Signal Density?

How many labeled examples per day? What is the quality? Is every user interaction generating training signal, or is labeling a separate, expensive process?

Cursor generates 400 million labeled examples per day from accept/reject decisions. Each one is free, high-quality, and domain-specific. Replicating this signal through synthetic data or paid labelers would cost the labs $80B–$800B. That is the moat. A company that generates 100 labeled examples per week from manual review has no signal moat. The density must be high enough, and the collection must be automatic enough, that the flywheel spins without human intervention at the labeling step.

Question 4: Where Are They on the Learning Stack?

There is a progression, and each level has a different half-life and a different ceiling on compounding.

Figure 15 — The Learning Stack: Six Levels of Defensibility

Most vertical AI companies today operate at Level 1 (harness) or Level 2 (eval-driven). The companies with the strongest positions — Cursor, Harvey, Abridge — have reached Level 3 or Level 4. Level 5 remains theoretical. The level determines the half-life: higher is more durable. The dashed border on Level 5 indicates it is not yet achieved by any production system.

Most vertical AI companies today are at Level 1. The ones at Level 3 or above are the ones whose unit economics improve with each model release rather than degrade. The question is not just where they are now, but what their trajectory looks like. A company at Level 2 with a credible path to Level 4 is more investable than one already at Level 3 with no plan to go further.

Question 5: Can They Go Full-Stack?

Do they have, or can they build, all three components: evaluator + open-source base model + production signal loop? The full-stack play is the endgame, and most companies will not reach it. But the capacity to reach it — the evaluator depth, the signal density, the engineering talent for post-training — is the leading indicator of long-term durability.

Figure 16 — The Five-Question Checklist

The five questions function as a sequential filter. A company that passes all five has a structurally durable position — one that gets stronger, not weaker, as foundation models improve. A company that fails at any stage has a known vulnerability and a known timescale for that vulnerability to become existential. The rightward branches indicate the specific failure mode at each stage.

These five questions are not independent. They form a causal chain. The domain topology (Q1) determines where evaluators are needed (Q2). The evaluator enables reward signal generation (Q3). Signal density determines the learning stack level (Q4). And all three together determine whether the full-stack play is feasible (Q5). A company that scores well on Q1 but poorly on Q2 has identified the right domain but has not built the right infrastructure. A company that scores well on Q3 but poorly on Q2 has signal it cannot effectively use.

The deepest version of the thesis is simple: back companies that own scarce reward, scarce verification, or scarce production signal. The five questions are a structured way to determine whether a company owns any of these, how durable that ownership is, and whether it compounds.

The compositionality miracle is the foundation. Neural representations compose — this is the single fact that makes foundation models the center of gravity. Parametric knowledge compounds. Retrieved knowledge does not. This asymmetry determines everything.

But parametric learning has a fatal limitation: catastrophic forgetting. You cannot teach a deployed model new things without destroying old ones. The stability-plasticity tradeoff is unsolved. Every vertical AI company operates in the space defined by this tension.

The resolution is structural, not clever. Decompose your domain into its verifiability topology. Build differently in each zone. Own the evaluator in the contested yellow zone. Collect production signal in the green zone. Wrap human judgment in the red zone. And if you can close the loop — evaluator, open-source base, production signal — you achieve model independence. The foundation model becomes a commodity input. Your domain expertise becomes the compounding asset.

Verifiability determines who compounds. Everything else is scaffolding.

Part V

The Builder’s Playbook

Section 14

Seven Moves

Parts I through IV build a framework. This section turns it into a sequence of decisions. If you are building a vertical AI company, these are the moves that determine whether you compound through model releases or get absorbed by them. The order matters — each step depends on the one before it.

Move 1: Decompose your domain at the task level

Do not describe your domain as a single category. "Legal AI" is not a domain topology. Break it into the actual tasks your product performs. Contract review. Clause extraction. Citation verification. Due diligence synthesis. Deposition preparation. Each task has a different verification profile.

Run the verification litmus test from Section 7 on every task:

Can a non-expert check the output?
Can the check be automated?
Does the check take less time than producing the output?
Is the check deterministic?

Classify each task into a zone. Green (strongly verifiable) means foundation models will absorb it — do not build your moat here, but do use it for signal generation. Yellow (weakly verifiable) is the contested zone where your evaluator determines who wins. Red (non-verifiable) is safe from absorption but hard to improve through automated learning.

The goal is not to avoid green zones. It is to understand which zones generate signal and which zones hold value. Cursor's green zone (code compilation, test passing) generates 400 million training signals per day. Their yellow zone (code quality, architectural decisions) is where the moat actually sits. The green zone feeds the yellow zone.

Move 2: Build the evaluator before the product

This is the single most counterintuitive move, and the most important one. Most startups build the product first and add evals later. Reverse that.

The evaluator is not a test suite. It is domain-specific verification infrastructure that encodes what "correct" means in your domain at a level of granularity that no foundation model lab will replicate. Harvey built BigLaw Bench — 18,000 structured legal workflows — before they had a production model worth testing against it. Abridge trained proprietary ASR evaluation against 1.5 million medical encounters. Campfire encoded GAAP rules as machine-checkable constraints.

Why the evaluator comes first: it is the reward surface that everything else optimizes against. Without it, you cannot generate reward signal for RL. You cannot measure whether model updates improved your domain performance. You cannot distinguish between a model regression and a product bug. Every dollar spent on model development without a strong evaluator is a dollar spent with no feedback loop.

Start with the yellow-zone tasks from Move 1. For each one, ask: what would a domain expert check, and can any part of that check be automated? Even partial automation — checking citation accuracy while leaving argument quality to humans — creates a reward surface you can train against.

Move 3: Design every interaction to generate training signal

The product is not just a product. It is a data collection instrument. Every user interaction should generate a labeled training example without additional cost or effort.

The hierarchy of signal quality:

Implicit accept/reject — the user accepts or discards the output. Free, high-volume, but noisy. Cursor gets 400 million of these per day.
Corrections — the user edits the output. Medium-volume, high-quality. Abridge gets physician corrections on every clinical note.
Outcome verification — a downstream system confirms correctness. Low-volume, highest quality. Campfire gets CFO approval on every categorization.

Design for all three. The product's UX should make acceptance visible (a "use this" button, not silent copy-paste), corrections structured (tracked diffs, not unmonitored edits), and outcomes captured (what happened after the user acted on the AI's output).

If your users interact with your product and you do not capture training signal from every interaction, you are leaving your flywheel on the table.

Move 4: Use RL, not SFT

This follows directly from Moves 2 and 3. You now have an evaluator (reward surface) and production signal (training data). The question is how to use them.

Shenfeld et al. (2025) showed that on-policy RL retains 93% accuracy on prior tasks while supervised fine-tuning degrades catastrophically.9 The mechanism: SFT directly overwrites weights. RL learns through exploration and policy updates, preserving existing capabilities while adding new ones.

The implication is direct: if you only have datasets, you can only do SFT, and you will suffer catastrophic forgetting. If you have an evaluator, you can do RL, and you preserve general capabilities while specializing. This is why Move 2 (evaluator first) gates Move 4. No evaluator means no reward signal means no RL means forgetting.

Feature-constrained LoRA makes this accessible.14 Rank-32 LoRA on Qwen3-8B matched full supervised fine-tuning at 9× lower cost. You do not need lab-scale compute. You need lab-quality signal.

Move 5: Build on open-source foundations

Use Llama, Mistral, DeepSeek, or Qwen as your base model. This is not a quality concession. It is a strategic necessity.

If your training loop runs on a proprietary model, the provider can observe your signal, restrict your access, or launch a competing product. If it runs on an open-source model, no one has visibility into your domain-specific post-training. The foundation model becomes a commodity input — interchangeable, upgradable, non-threatening.

This does not mean you cannot use proprietary models in production. Harvey routes between GPT-4, Claude, and internal models. The point is that your training infrastructure — the loop that generates your compounding advantage — should not depend on a model provider who has both the capability and the incentive to absorb what you have learned.

Move 6: Close the loop

If you have executed Moves 1 through 5, you now have the components of the full-stack architecture from Section 11:

Domain evaluator (Move 2) defines what "correct" means
Production signal loop (Move 3) generates labeled data at scale
Open-source base model (Move 5) provides model independence
RL post-training (Move 4) updates the model on your signal through your evaluator

The result is a closed flywheel. The domain model produces outputs. Users interact with those outputs, generating production signal. The signal is scored by your evaluator. The scored signal trains the next model version via RL. The improved model produces better outputs. Each cycle tightens the loop.

This is the endgame. A company with a closed loop improves at a rate determined by its production volume and evaluator quality, not by the foundation model lab's RL budget. The lab is spreading $1B across every domain. You are concentrating $10M on one. If your domain is narrow enough and your signal dense enough, the math works.

Move 7: Know your half-lives

Every layer in the stack has a shelf life. Know where you are building and how long it survives before the next model release absorbs it.

Layer Half-Lives

Wrappers (prompt templates, thin UI): 4–6 months. If this is your product, you have one or two model releases before you are redundant.
Cognitive scaffolding (chain-of-thought, tool orchestration): 6–12 months. SWE-agent's 13 custom commands were absorbed in 10 months.
Fine-tuned models: ~14 months. Harvey's legal fine-tune was surpassed by seven frontier models on its own benchmark within 14 months.
Evaluation infrastructure: Years. No evidence of labs absorbing domain-specific evaluation capabilities.

The half-life tells you how fast you need to climb. If you are at Level 1 (harness), you have 6–12 months to reach Level 2 (eval-driven) before the next model release absorbs your scaffolding. If you are at Level 2, the timeline is more forgiving — but the goal is still Level 4 (domain post-training), where you are improving faster than the labs can absorb.

The race is not against competitors. It is against the foundation model's absorption rate. Every quarter, the model gets better at tasks in your green zone. The question is whether your yellow-zone advantage compounds faster than the green zone expands.

Section 15

The Sequence

The seven moves above have a natural ordering that maps to company stage.

Pre-product (months 0–3): Decompose the domain (Move 1). Build the evaluator (Move 2). Most founders will resist this — the instinct is to ship product. Resist the instinct. The evaluator is the foundation everything else compounds on. Three months spent on evaluation infrastructure saves three years of building on a melting iceberg.

First product (months 3–9): Ship the product, but design every interaction for signal capture (Move 3). The product is both a revenue source and a data collection instrument. Measure signal density from day one: labeled examples per day, collection cost per example, signal quality. If the numbers are not growing, the flywheel is not spinning.

Post-training (months 9–18): Begin RL on production signal through your evaluator (Move 4). Move to an open-source base (Move 5). This is where most vertical AI companies stall — they have signal and evals but lack the engineering to run post-training pipelines. Hire for this capability early. It is the difference between Level 2 and Level 4.

Closed loop (months 18+): Close the loop (Move 6). Ship model updates trained on your own production data, scored by your own evaluator, on your own open-source base. At this point, every model release from the labs helps you — a better base model means your post-training starts from a higher floor. You are no longer competing with the labs. You are compounding on top of them.

Throughout: know your half-lives (Move 7). If the absorption rate is faster than expected, accelerate the sequence. If it is slower, you have more time to build depth at each level. The half-lives are empirical, not theoretical — track them against each model release and adjust.

Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." arXiv. arXiv:2209.10652. Provides evidence that neural networks learn compositional, superposed representations — features sharing dimensions in ways that enable recombination far beyond what was explicitly trained.
Gunasekar, S., Zhang, Y., Anber, J., et al. (2023). "Textbooks Are All You Need." arXiv. arXiv:2306.11644. Phi-1 demonstrated that code pre-training on high-quality textbook data transfers to general reasoning, providing direct evidence for the code-to-reasoning compositionality claim.
McCloskey, M. & Cohen, N. J. (1989). "Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem." Psychology of Learning and Motivation, 24, 109–165. The foundational demonstration that neural networks trained sequentially on different tasks suffer catastrophic forgetting of earlier tasks.
van de Ven, G. M. & Tolias, A. S. (2019). "Three Scenarios for Continual Learning." arXiv. arXiv:1904.07734. Distinguishes task-incremental learning (model knows which task to perform), domain-incremental learning (same task, shifting distribution), and class-incremental learning (model must distinguish between all classes seen so far). Class-incremental is the most demanding and maps directly to Lichkovski's compositionality desideratum.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." Proceedings of the National Academy of Sciences. arXiv:1612.00796. Introduces Elastic Weight Consolidation (EWC), which selectively slows learning on weights important to previous tasks using Fisher information.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., et al. (2016). "Progressive Neural Networks." arXiv. arXiv:1606.04671. Avoids catastrophic forgetting by freezing prior columns and adding new capacity per task, enabling lateral transfer without interference.
Lopez-Paz, D. & Ranzato, M. (2017). "Gradient Episodic Memory for Continual Learning." NeurIPS. arXiv:1706.08840. Stores a subset of examples from each task and constrains gradient updates to avoid increasing loss on stored examples.
Lin, J., et al. (2025). "Sparse Continual Learning." arXiv. arXiv:2510.15103. Shows that updating only a small fraction of parameters per task reduces forgetting from 89% to 11%, offering a practical path toward satisfying the stability-plasticity tradeoff.
Shenfeld, M., et al. (2025). "RL's Razor: On-Policy Reinforcement Learning is All You Need for RLHF." arXiv. arXiv:2509.04259. Demonstrates that on-policy RL retains 93% accuracy on prior tasks while supervised fine-tuning degrades catastrophically. Major evidence that RL's compositionality is not incidental but structural.
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv. arXiv:2408.03314. Shows that a small model with optimally allocated test-time compute can match a 14x larger model, reframing the harness ceiling as partly an inference allocation problem.
Jimenez, C. E., Yang, J., Wettig, A., et al. (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv. arXiv:2310.06770. Introduced the standardized benchmark of 2,294 real GitHub issues with automated test verification, pulling functional code correctness into the strongly verifiable zone.
Lightman, H., Kosaraju, V., Burda, Y., et al. (2023). "Let's Verify Step by Step." arXiv. arXiv:2305.20050. Demonstrates that process reward models (scoring each reasoning step) substantially outperform outcome reward models (scoring only the final answer), expanding the surface area of what can be verified and trained on.
The "verification fidelity" framework synthesized here draws on several public sources. Wei, J. (2025). "Asymmetry of Verification and Verifier's Rule." jasonwei.net. Formalizes this as: "The ease of training AI to solve a task is proportional to how verifiable the task is," decomposing verifiability into objective truth, speed, scalability, low noise, and continuous reward. Brown, N. (2024). "Generator-Verifier Gap." Sequoia Capital podcast. Frames the mechanism as the gap between the difficulty of generating a correct solution and verifying one — domains with large gaps (math, code) improve first under RL. DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. Provides empirical evidence: rule-based (cheap, deterministic) rewards were chosen over neural reward models because the latter suffer reward hacking at scale — confirming that reward signal fidelity gates training velocity. See also footnotes 9, 11, and 12 above.
ngeo (2026). "Applied Continual Learning." ngeo.dev/studies/applied/continual_learning. Demonstrates feature-constrained LoRA: projecting weight updates onto interpretable feature directions achieves domain adaptation at a fraction of full fine-tuning cost (rank-32 LoRA on Qwen3-8B matched full SFT at 9x lower cost). Synthesizes results from Titans memory layers (arXiv:2501.00663), Thinking Machines on-policy distillation, and Cursor's continuous RL pipeline shipping new Composer checkpoints every five hours.

← back to index