Breaking the Context Ceiling: Implementing Recursive Language Models with LangGraph and TypeScript

This post is adapted from my talk at Node Congress 2026.

TL;DR - Context windows are marketing numbers — LLMs struggle with information buried in the middle of long prompts, even when it technically "fits" - Recursive Language Models (RLM) treat prompts as programmatic environments where an LLM writes code to explore documents via a sandboxed VM. The document never enters the LLM's context — it's accessed through code as a symbolic handle. Our implementation uses LangGraph to orchestrate code generation, sandboxed execution, and synthesis. - We ran it on a 904K-character document (213K tokens) — a naive single-pass call fails with a 400 error; RLM generated exploration code that made 8 recursive sub-LM calls and found 7 critical breaking changes - LangGraph gives you the state management, iteration, and conditional routing needed to implement RLM cleanly in ~400 lines of TypeScript

The 200K Token Lie¶

You've heard the pitch. Claude offers up to 200K tokens (with 1M in beta). GPT-4o handles 128K tokens (with newer models pushing past 1M). Gemini supports up to 2 million tokens. So you stuff your entire API migration guide into the prompt, ask "what are all the breaking changes?" and get back... a partial list that confidently misses the webhook schema change buried on page 34.

What happened? Your document was well under the token limit. The API didn't complain. You paid for all those tokens.

Here's the uncomfortable truth: context windows are theoretical maximums, not practical working limits. Research by Liu et al. (2024) at Stanford demonstrated that LLMs experience significant performance degradation when relevant information is placed in the middle of long contexts. They're decent at using information near the beginning (primacy) and near the end (recency), but everything in between becomes progressively fuzzier. The phenomenon is so consistent there's a name for it: the "lost in the middle" problem.

The answer isn't to wait for bigger context windows. Even if we had 10 million token windows tomorrow, the attention mechanism that powers these models would still struggle with needle-in-haystack retrieval across massive contexts. We need a different approach.

The Problem with Long Context¶

Let's get specific about why context windows fail in practice.

The "needle in a haystack" benchmark is the sanitized version of this problem. Drop a specific fact into a sea of irrelevant text and ask the model to retrieve it. Leading frontier LLMs (Claude Opus, GPT-5, Gemini Pro) now score well on this synthetic test, achieving >95% accuracy. But real documents aren't haystacks with one needle — they're complex webs of interrelated information where understanding requires synthesis across multiple sections.

RAG (Retrieval Augmented Generation) helps by chunking documents, embedding them, and retrieving only relevant chunks. But traditional RAG has blindspots. It's optimized for finding specific facts, not for understanding how different sections of a document relate to each other. If you're analyzing a monorepo and need to understand how the authentication middleware in auth/middleware.ts interacts with the rate limiter in services/limiter.ts and the retry logic in lib/resilience.ts, basic RAG's independent chunk retrieval often misses these connections. Advanced techniques like GraphRAG and agentic RAG have begun to address these limitations, but they add significant complexity and are still maturing.

Map-reduce gets closer. Split the document, process each chunk, aggregate the results. But classic map-reduce is too rigid. It processes every chunk even when most are irrelevant. Single-pass map-reduce runs one iteration and stops. While iterative variants like refine chains exist in LangChain, they add complexity and still lack the dynamic chunk selection that makes RLM effective. The "map" phase is embarrassingly parallel but dumb — no guidance about what to focus on.

What we actually need is something that combines the efficiency of selective processing with the thoroughness of multi-pass analysis. We need an approach that can decide which parts of a document deserve deep analysis, process those parts in parallel, accumulate structured findings, and iterate if the initial pass missed something important.

Enter the Recursive Language Model¶

The Recursive Language Model concept, introduced by Zhang, Kraska, and Khattab at MIT (2025), offers a principled solution to the long-context problem. Instead of fighting context limits by stuffing more tokens into a single prompt, RLM works deliberately within those limits through iteration and programmatic orchestration.

The MIT Paper's Architecture

The original RLM paper describes a single root language model that writes Python code in a persistent REPL environment. The root LM can invoke sub-LMs recursively through programmatic function calls, with each sub-LM call happening sequentially (blocking) and returning structured results that the root LM processes through code execution. This creates a feedback loop where the model's code output guides its own next steps — hence "recursive." The paper's implementation treats prompts as programmatic environments, not as a multi-agent system with distinct roles.

Our Practical Adaptation for TypeScript

For our practical TypeScript implementation with LangGraph, we stay close to the MIT paper's core architecture but adapt it for production use in Node.js. Our implementation has three nodes:

The Orchestrator is where the LLM writes code. It never sees the full document — only metadata like length and results from previous turns. Its job is to generate TypeScript code that will explore the document programmatically. The LLM decides the exploration strategy: which sections to slice, what patterns to search for, how many recursive analysis calls to make. This is the "symbolic programming" layer from the paper.

The Code Executor runs the generated code in a sandboxed Node.js VM context. The document lives here as a global variable P (the "symbolic handle" from the paper). The sandbox provides utility functions: search() and searchRegex() for finding patterns, slice() and extractSection() for grabbing text, and crucially, analyze() and analyzeMultiple() for making recursive sub-LM calls. When the code calls analyze(query, text), the sandbox triggers an LLM invocation — this is the "symbolic recursion" mechanism. The VM context persists across turns, so variables defined in one iteration remain available in the next.

The Synthesizer takes the execution results — not raw document text, but the structured output from code execution — and produces the final answer. It's working with metadata and findings extracted by the code, making synthesis tractable even for very long documents.

The key insight from RLM is zero-token prompting. The document never enters the LLM's context. The LLM writes code to explore it programmatically. This breaks the context ceiling by keeping the LLM's context lean while giving it access to arbitrarily large documents through code.

Why LangGraph?¶

LangGraph is a graph execution framework from the LangChain team, designed specifically for building stateful, multi-step LLM workflows. It's not another prompt wrapper library. It's a runtime for applications where LLM calls are nodes in a graph, and the graph structure determines how information flows between them.

The framework gives you three critical capabilities for implementing RLM:

State management with reducers. Each node in the graph can read and modify a shared state object. You define reducers that specify how state updates merge — do new findings append to the list, or replace it? This is essential for RLM's accumulation pattern.

Conditional edges. After a node executes, you can route to different next nodes based on the current state. The orchestrator can route to code executor or synthesis based on turn count. The code executor can route back to the orchestrator for another turn or proceed to synthesis based on whether setFinal() was called.

Native support for cycles. Unlike traditional DAGs, LangGraph graphs can have loops. You can send execution back to a previous node based on conditions. This enables the multi-pass iteration that makes RLM effective.

The framework has a well-maintained TypeScript SDK with proper type inference and async/await throughout. While the Python SDK often receives experimental features first, the TypeScript SDK has reached production-ready maturity. No callback hell, no stringly-typed state keys. For developers coming from the Node.js ecosystem, the async/await patterns and middleware-like node composition will feel familiar.

LangGraph is the perfect fit for RLM because the pattern maps directly to its primitives: stateful nodes (orchestrator, code executor, synthesizer), conditional routing (based on execution results and turn count), and iteration (orchestrator ↔ code executor cycle).

Implementation Walkthrough¶

Let's build a working RLM system. I'll walk through the key pieces with actual code.

Defining the State¶

The state object is the spine of the application. Every node reads from it and writes to it. LangGraph uses an Annotation.Root structure to define both the shape of the state and how updates merge.

import { Annotation } from "@langchain/langgraph";

const RLMState = Annotation.Root({
  // The user's question
  query: Annotation<string>,

  // The full source document (injected as P in sandbox)
  document: Annotation<string>,

  // LLM-generated TypeScript code to execute in sandbox
  generatedCode: Annotation<string>({
    reducer: (_, b) => b, // Replace on each turn
    default: () => "",
  }),

  // Results from sandbox code execution
  executionOutput: Annotation<string>({
    reducer: (_, b) => b, // Replace on each turn
    default: () => "",
  }),

  // Metadata accumulated across all turns (what the LLM "learned")
  executionMetadata: Annotation<string[]>({
    reducer: (a, b) => [...a, ...b], // Accumulate
    default: () => [],
  }),

  // Current turn counter
  turn: Annotation<number>({
    reducer: (_, b) => b,
    default: () => 0,
  }),

  // Maximum turns before forcing synthesis
  maxTurns: Annotation<number>({
    reducer: (_, b) => b,
    default: () => 5,
  }),

  // Synthesized final answer
  finalAnswer: Annotation<string>({
    reducer: (_, b) => b,
    default: () => "",
  }),
});

type RLMStateType = typeof RLMState.State;

Pay attention to the executionMetadata field. Its reducer is (a, b) => [...a, ...b] — this appends new metadata to the existing array rather than replacing it. This accumulation pattern is central to RLM. Each turn adds metadata about what the code discovered, building up the LLM's understanding across multiple passes.

The turn counter tracks how many code generation/execution cycles we've completed. The maxTurns cap prevents infinite loops if the orchestrator keeps generating more exploration code.

The Orchestrator¶

The orchestrator's job is to generate TypeScript code that will explore the document. It never sees the full document — only metadata.

async function orchestrator(
  state: RLMStateType
): Promise<Partial<RLMStateType>> {
  const turnNum = state.turn + 1;

  // Build prompt with metadata about P, not P itself
  const metadataInfo =
    state.executionMetadata.length > 0
      ? `\n\nPrevious exploration results:\n${state.executionMetadata.join("\n")}`
      : "";

  const prompt = `You have access to a document stored in variable P.

Document metadata:
- Length: ${state.document.length.toLocaleString()} characters
- Available via: P (global variable)
${metadataInfo}

User query: ${state.query}

Write TypeScript code using the utility functions to find the answer. Use analyze() or analyzeMultiple() for sub-analyses. Call setFinal() when you have the complete answer.`;

  const response = await llm.invoke([
    { role: "system", content: RLM_SYSTEM_PROMPT },
    { role: "user", content: prompt },
  ]);

  // Extract code from markdown code blocks
  const codeMatch = response.content.match(/```typescript\s*([\s\S]*?)\s*```/);
  const generatedCode = codeMatch ? codeMatch[1].trim() : response.content.trim();

  return {
    generatedCode,
    turn: turnNum,
  };
}

The orchestrator prompts the LLM with a system message explaining the available utility functions: search(), searchRegex(), slice(), extractSection(), analyze(), analyzeMultiple(), setFinal(), and log(). The LLM writes code that uses these functions to explore the document P.

Critically, the LLM sees only the document's length and previous turn metadata — not the document itself. This keeps the context lean. The document lives in the sandbox as variable P, accessed only through code execution.

The Code Executor¶

The code executor runs the generated code in a sandboxed Node.js VM. This is where the document lives as variable P and where recursive sub-LM calls happen.

async function codeExecutor(
  state: RLMStateType
): Promise<Partial<RLMStateType>> {
  try {
    const sandbox = getOrCreateSandbox(state.document, handleSubRLM);
    const result = await sandbox.execute(state.generatedCode);

    return {
      executionOutput: result.output,
      executionMetadata: [result.metadata],
    };
  } catch (error) {
    const errorMsg = error instanceof Error ? error.message : String(error);
    return {
      executionOutput: `Error: ${errorMsg}`,
      executionMetadata: [`Error during execution: ${errorMsg}`],
    };
  }
}

The Sandbox class wraps Node's vm.createContext() API to create an isolated execution environment. Here's the key part of the sandbox implementation:

export class Sandbox {
  private context: vm.Context;

  constructor(
    document: string,
    llmFunction: (query: string, text: string) => Promise<string>
  ) {
    this.context = vm.createContext({
      // The symbolic handle — the document lives here
      P: document,

      // Search for literal pattern
      search(pattern: string): Array<{ index: number; context: string }> {
        const results = [];
        let pos = 0;
        while (pos < P.length) {
          const index = P.indexOf(pattern, pos);
          if (index === -1) break;
          results.push({ index, context: P.slice(index - 100, index + 200) });
          pos = index + 1;
        }
        return results;
      },

      // Regex search
      searchRegex(pattern: string, flags?: string): Array<{ index: number; match: string; context: string }> {
        // ... similar to search(), but with RegExp
      },

      // Slice document
      slice(start: number, end: number): string {
        return P.slice(start, end);
      },

      // Extract text between markers
      extractSection(startMarker: string, endMarker: string): string[] {
        // ... finds all text between marker pairs
      },

      // Recursive sub_RLM call — invokes the LLM on a sub-query
      async analyze(query: string, text: string): Promise<string> {
        return await llmFunction(query, text);
      },

      // Parallel sub_RLM on multiple texts
      async analyzeMultiple(query: string, texts: string[]): Promise<string[]> {
        const promises = texts.map(text => llmFunction(query, text));
        return Promise.all(promises);
      },

      // Signal final result
      setFinal(result: string): void {
        finalResult = result;
        finalCalled = true;
      },

      // Log metadata for the orchestrator
      log(message: string): void {
        logs.push(message);
      },
    });
  }

  async execute(code: string): Promise<SandboxResult> {
    // Wrap in async IIFE so top-level await works
    const wrapped = `(async () => {\n${code}\n})()`;
    const script = new vm.Script(wrapped);
    const promise = script.runInContext(this.context);
    await promise;

    return {
      output: finalCalled ? finalResult : logs.join("\n"),
      metadata: logs.join("; "),
      isFinal: finalCalled,
    };
  }
}

When the LLM-generated code calls analyze(query, text), the sandbox invokes the LLM on that text — this is the recursive sub-LM mechanism. The key insight: these sub-calls happen inside the sandbox, initiated by code, not by the LLM verbalizing "I will now analyze section X." The LLM writes a loop that programmatically launches many sub-calls.

The VM context persists across turns. Variables defined in turn 1 remain available in turn 2. This enables iterative refinement.

The Synthesizer¶

The synthesizer receives the execution results and produces the final answer.

async function synthesizer(
  state: RLMStateType
): Promise<Partial<RLMStateType>> {
  const prompt = `Query: ${state.query}

Execution results from code analysis:
${state.executionOutput}

Previous exploration metadata:
${state.executionMetadata.join("\n")}

Synthesize these results into a clear, well-structured answer to the user's query.`;

  const response = await llm.invoke([
    {
      role: "system",
      content: "Synthesize execution results into a complete answer. Be clear and concise.",
    },
    {
      role: "user",
      content: prompt,
    },
  ]);

  return {
    finalAnswer: response.content as string,
  };
}

The synthesizer's prompt is working with the output from code execution, not raw document text. The code already did the exploration work — searching, slicing, and recursively analyzing sections. The synthesizer just needs to package those results into a coherent answer.

This is fundamentally different from traditional chunking approaches. The synthesizer isn't working with pre-cut document chunks or manually extracted findings. It's working with the output of a programmatic exploration strategy that the LLM itself designed.

Why This Architecture Works¶

The key insight is zero-token document prompting. The document never enters the LLM's context window. Instead:

Symbolic handle: The document lives in the sandbox as variable P. The LLM writes code to access it.
Code + reasoning: The LLM uses deterministic operations (search(), slice()) to filter before spending tokens on analyze() calls.
Programmatic recursion: The LLM writes a loop that calls analyzeMultiple() on 8 sections — not 8 separate verbal delegation steps.
Iterative refinement: If the first turn's code doesn't find enough, the LLM sees the metadata and writes better code for turn 2.

This breaks the context ceiling by keeping the LLM's working memory lean while giving it access to arbitrarily large documents through code execution.

Wiring the Graph¶

Now we assemble the nodes into a graph with conditional routing.

import { StateGraph, START, END } from "@langchain/langgraph";

function routeAfterOrchestrator(state: RLMStateType): string {
  if (state.turn >= state.maxTurns) {
    return "synthesizer";
  }
  return "codeExecutor";
}

function routeAfterExecution(state: RLMStateType): string {
  // Check if code called setFinal (indicated in metadata or output)
  const hasFinalResult =
    state.executionOutput.includes("Final result") ||
    state.executionMetadata.some((m) => m.includes("Final result"));

  if (hasFinalResult || state.turn >= state.maxTurns) {
    return "synthesizer";
  }

  return "orchestrator"; // Loop back for another turn
}

const graph = new StateGraph(RLMState)
  .addNode("orchestrator", orchestrator)
  .addNode("codeExecutor", codeExecutor)
  .addNode("synthesizer", synthesizer)
  .addEdge(START, "orchestrator")
  .addConditionalEdges("orchestrator", routeAfterOrchestrator)
  .addConditionalEdges("codeExecutor", routeAfterExecution)
  .addEdge("synthesizer", END)
  .compile();

The routeAfterOrchestrator function checks: if we've hit max turns, synthesize. Otherwise, execute the generated code.

The routeAfterExecution function checks: if the code called setFinal() (signaling it has the complete answer), or if we've hit max turns, route to synthesis. Otherwise, loop back to the orchestrator for another turn.

This creates a cycle: orchestrator → code executor → orchestrator → code executor → ... → synthesizer. The number of turns is determined dynamically based on whether the code signals completion via setFinal(), up to maxTurns.

Running the Demo¶

Clone the repo, install dependencies, and run against any document:

cd example
npm install
echo "ANTHROPIC_API_KEY=your-key-here" > .env
npx tsx src/index.ts ./data/api-v2-migration-guide.md \
  "What are all the breaking changes in the v2 API migration?"

Here's what the real output looks like on a 904K-character API migration guide:

🚀 Recursive Language Model Demo

[Config] File: ./data/api-v2-migration-guide.md
[Config] Query: What are all the breaking changes in the v2 API migration?

✓ Loaded document (903,920 characters)

Starting graph execution...

[Orchestrator] Turn 1 — LLM generating code...
✓ Generated 31 lines of code

[Code Executor] Executing generated code...
[Sandbox] sub_RLM(1): What authentication changes...
[Sandbox] sub_RLM(2): What pagination changes...
[Sandbox] sub_RLM(3): What rate limiting changes...
[Sandbox] sub_RLM(4): What error format changes...
[Sandbox] sub_RLM(5): What endpoint restructuring...
[Sandbox] sub_RLM(6): What webhook changes...
[Sandbox] sub_RLM(7): What SDK requirements...
[Sandbox] sub_RLM(8): What deprecation timeline...
✓ Execution complete: Final result. Found 7 critical breaking changes...

[Synthesizer] Generating final answer...
✓ Answer synthesized

FINAL ANSWER
========================================

# Breaking Changes in v2 API Migration

1. **Authentication System Overhaul**:
   X-API-Key header removed, replaced by Authorization: Bearer <jwt>.
   OAuth 2.1 with PKCE support added.

2. **Pagination → Cursor-Based**:
   page/per_page params removed. Use cursor and limit instead.

3. **Rate Limiting → IETF**:
   X-RateLimit-* → RateLimit-*. Retry-After returns seconds, not date.

4. **Error Format → RFC 7807**:
   Simple error messages replaced by Problem Details structure.

5. **Endpoint Restructuring**:
   Base path /v1/ → /v2/. RESTful naming standardized.

6. **Webhook Changes**:
   New event/data envelope. Signature header renamed.

7. **SDK Version Requirement**:
   Requires SDK v3.0+. v1 sunset: 2026-09-30.

✓ Graph execution completed!

Before & After: Why RLM Matters¶

To understand the difference, compare what a naive single-pass LLM call returns for the same query on the same 904K-character document:

Naive approach (entire document in one prompt):

ERROR: 400
Type: invalid_request_error
Message: prompt is too long: 212958 tokens > 200000 maximum

The document doesn't even fit in the context window. At 212,959 tokens, it exceeds Claude's 200K limit. You can't even start the analysis. And even if a model had a large enough window, context rot would degrade the results for a document this size.

RLM approach (code-generated exploration): Found all 7 critical breaking change areas with detailed analysis. The LLM wrote code that searched for section markers, extracted relevant passages, and made 8 parallel recursive sub-LM calls to analyze each area. No pre-chunking, no fixed chunk selection — the exploration strategy was designed by the LLM itself based on the document structure.

This is the core value proposition: thorough analysis that doesn't miss information buried in the middle of long documents, and that works on documents far too large for any single LLM call.

Production Considerations¶

Before you ship this to production, let's talk about the practical challenges.

Cost and latency. Let's break down a typical run on our 904K-character document. The orchestrator makes one LLM call to generate exploration code (~2K tokens). The code executor runs that code, which triggers 8 parallel recursive sub-LM calls via analyzeMultiple() (8 LLM calls, each analyzing a ~5K character section = ~40K tokens input, ~8K tokens output). The synthesizer makes one final call (~5K tokens input). Total: roughly 47K input tokens, 12K output tokens.

At Claude Sonnet pricing (as of March 2026: $3 per million input tokens, $15 per million output tokens), that's about $0.32 per run. Not bad for one-off analysis of a 904K document that a naive approach can't even start on. But if you're processing hundreds of documents daily, costs add up fast.

Latency compounds too. You're making multiple round trips: ~3 seconds for the orchestrator, ~5 seconds for 8 parallel sub-LM calls, ~3 seconds for synthesis. That's 11 seconds minimum, and you might iterate twice. Compare this to a single 6-second RAG query. Mitigations: prompt the orchestrator to target 5-8 sub-calls (not dozens), cache document structure metadata if you reprocess similar documents, and stream the final answer while synthesis is still running.

Failure handling is non-negotiable. Any individual LLM call can fail or timeout. The code executor wraps execution in try-catch blocks. Recursive sub-LM calls can fail — implement exponential backoff. The maxTurns cap prevents runaway loops if the orchestrator keeps generating more exploration code, but you should also add a wall-clock timeout. Sandbox the VM execution properly — limit memory, timeout long-running code.

You need observability. When a run produces bad results, you need to answer specific questions: did the LLM generate incorrect code? Did a sub-LM call hallucinate? Did the code call the wrong utility functions? LangSmith integrates directly with LangGraph and shows you the full execution trace — every node, every state transition, every LLM call with prompt and response. You can see the exact generated code, the sandbox execution output, and every recursive sub-LM call.

Beyond tracing, log turn counts, generated code structure (how many sub-calls, which utility functions used), and cost per run. If you notice the orchestrator always generates code with dozens of sub-calls, your system prompt might need adjustment. If synthesis quality drops, the code might be slicing sections too broadly.

When NOT to use RLM. If your use case is well-served by RAG, stick with RAG. It's faster, cheaper, and simpler. RLM shines when you need cross-document analysis, nuanced synthesis, or when simple chunk retrieval misses important connections. Don't overcomplicate if you don't need to.

RLM vs RAG: A Decision Framework¶

Here's a quick decision tree to help you choose:

Need a specific fact from a known location? → Just use the LLM directly with the relevant section
Need facts scattered across an unknown location in a large document? → RAG
Need deep analysis that synthesizes across the entire document? → RLM
Processing multiple related documents that need cross-referencing? → RLM
Extreme cost or latency sensitivity with acceptable quality trade-offs? → RAG
Document comfortably fits in context window (< 20 pages)? → Just use the LLM directly

RAG is your workhorse for most retrieval tasks. RLM is your specialist for complex analysis.

One more consideration: RLM's code-based exploration gives you audit trails. You can see exactly what code was generated, which sections were analyzed, and what each recursive sub-call returned. For legal, compliance, or research applications where provenance matters, this transparency is valuable even if RAG would technically work.

Conclusion¶

Context windows will keep growing. We've already seen 10 million token windows (Llama 4 Scout launched with 10M context in April 2025), and this trend will continue. But the "lost in the middle" problem persists in current transformer architectures. While research into sparse attention, ALiBi, RoPE, and alternative positional encodings may eventually mitigate it through architectural improvements, it remains a practical concern for production systems today. Bigger context windows just move the problem further out, they don't solve it.

Recursive Language Models give you a principled way to work with these limits rather than against them. Instead of hoping the LLM can find the needle in the haystack, you use the LLM itself to decide where to look, what to analyze, and how to synthesize.

LangGraph makes the implementation surprisingly clean. The entire system we walked through is ~400 lines of TypeScript. No black magic, and while the implementation uses LangGraph APIs, the underlying patterns (stateful graphs, conditional routing, iteration) are conceptually portable to other frameworks. Just stateful graphs, conditional routing, and iteration.

If you're building anything that processes documents longer than 20 pages — legal contracts, research papers, technical specifications, policy documents — this pattern is worth having in your toolkit. You're not fighting the context ceiling anymore. You're working within it deliberately.

The code is modular enough that you can extend the sandbox utility functions, adjust the orchestrator's system prompt for domain-specific exploration strategies, or add validation layers between steps. Start with the example repo, break it, extend it, make it yours.

Context windows are a constraint. But constraints breed creativity. RLM is proof of that.