Everything you need to know about recursive language models
In this article, you will learn what recursive language models are, why they are important for long-input reasoning, and how they differ from standard prompting, retrieval, and long-context agent systems.
Topics we will cover include:
- Why a long context is not enough to resolve reasoning on very large inputs
- How recursive language models use an external runtime and recursive subcalls to process information
- The main trade-offs, limitations and practical use cases of this approach
Let’s get straight to the point.
Introduction
If you’re here, you’ve probably heard about recent work on recursive language models. The idea was trending on LinkedIn and X, and that led me to study the topic further and share what I learned with you. I think we can all agree that large language models (LLMs) have improved rapidly in recent years, particularly in their ability to handle large inputs. These advances have led many people to assume that long context is largely a solved problem, but this is not the case. If you’ve tried giving models very long inputs close to or equal to their popup, you may have noticed that they become less reliable. They often miss details in the information provided, contradict previous statements, or produce superficial answers instead of reasoning carefully. This question is often called “contextual rot” which is quite an interesting name.
Recursive language models (RLMs) are an answer to this problem. Instead of pushing more and more text into a single forward pass of a language model, RLMs first change how the model interacts with long input. In this article, we’ll look at what they are, how they work, and the types of problems they are designed to solve.
Why a long context is not enough
You can skip this section if you already understand the motivation of the introduction. But if you’re curious, or if the idea didn’t completely click the first time, let me break it down further.
The operation of these LLMs is quite simple. Everything we want the model to consider is provided to it as a single prompt, and based on this information the model generates the output token by token. This works well when the prompt is short. However, when it becomes very long, performance starts to degrade. This is not necessarily due to memory limitations. Even though the model can see the full prompt, it often fails to use it effectively. Here are some reasons that may contribute to this behavior:
- These LLMs are mainly transformer based models with attention mechanism. As the prompt lengthens, attention becomes more diffuse. The model struggles to focus clearly on what matters when it has to deal with tens or hundreds of thousands of tokens.
- Another reason is the presence of mixed heterogeneous information, such as logs, documents, code, chat history and intermediate outputs.
- Finally, many tasks are not just about retrieving or finding a relevant snippet in a huge corpus of content. They often involve grouping information across the entire entry.
Due to the issues discussed above, people have proposed ideas such as summarization and recovery. These approaches are useful in some cases, but are not universal solutions. Summaries are lossy by design, and retrieval assumes that relevance can be reliably identified before reasoning begins. Many real-world tasks violate these assumptions. This is why RLMs suggest a different approach. Instead of forcing the model to absorb the entire prompt at once, they allow it to actively explore and process the prompt. Now that we have the basic context, let’s take a closer look at how it works.
How a recursive language model works in practice
In an RLM configuration, the prompt is treated as part of the external environment. This means that the model does not directly read the entire input. Instead, the input is outside the model, often as a variable, and the model only receives metadata about the prompt along with instructions on how to access it. When the model needs information, it issues commands to examine specific parts of the prompt. This simple design helps keep the internal context of the model small and focused, even when the underlying input is extremely large. To understand RLMs more concretely, let’s walk through a typical execution step by step.
Step 1: Initializing a persistent REPL environment
At the start of an RLM execution, the system initializes an execution environment, typically a Python REPL. This environment contains:
- A variable containing the full user prompt, which can be arbitrarily large
- A function (for example, llm_query(…) or sub_RLM(…)) that allows the system to invoke additional language model calls on selected pieces of text
From the user’s perspective the interface remains simple, with textual input and output, but internally the REPL acts as a scaffold allowing for evolutionary reasoning.
Step 2: Calling the root model with prompt-only metadata
The root language model is then called, but it does not receive the full prompt. Instead it is given:
- Constant-size metadata about the prompt, such as its length or a short prefix
- Instructions describing the task
- Access instructions for interacting with the prompt through the REPL environment
By withholding the full prompt, the system forces the model to intentionally interact with the input, rather than passively absorbing it in the pop-up window. From this point on, the model interacts indirectly with the prompt.
Step 3: Inspecting and Decomposing the Prompt via Code Execution
The model can start by inspecting the structure of the input. For example, it can print first lines, search for titles, or split text into chunks based on delimiters. These operations are performed by generating code, which is then executed in the environment. The outputs of these operations are truncated before being displayed to the model, ensuring that the pop-up window is not overwhelmed.
Step 4: Issuing recursive subcalls on selected slices
Once the model understands the structure of the prompt, it can decide how to proceed. If the task requires a semantic understanding of certain sections, the model can issue subqueries. Each subquery is a separate language model call on a smaller part of the prompt. This is where the “recursive” part really comes in. The model decomposes the problem several times, processes parts of the input, and stores intermediate results. These results live in the environment and not in the context of the model.
Step 5: Assembling and returning the final response
Finally, once enough information has been collected and processed, the model constructs the final answer. If the exit is long:
- The model gradually builds it into a REPL variable, such as Final
- Once Final is defined, the RLM loop ends
- The value of Final is returned as response
This mechanism allows the RLM to produce outputs that exceed the token limits of a single language model call. Throughout this process, no language model call needs to see the full prompt.
What differentiates RLMs from recovery agents and systems
If you spend any time in the LLM space, you might confuse this approach with agentic frameworks or retrieval augmented generation (RAG). However, they are different ideas, even if the distinctions may seem subtle.
In many agent systems, the entire conversation history or working memory is repeatedly injected into the context of the model. When the context becomes too broad, older information is summarized or removed. RLMs avoid this pattern altogether by keeping the prompt external from the start. Retrieval systems, in contrast, rely on identifying a small set of relevant fragments before reasoning begins. This works well when relevance is low. RLMs are designed for contexts where relevance is dense and distributed, and where aggregation of many parts of the inputs is required. Another key difference is recursion. In RLM, recursion is not metaphorical. The model literally calls language models inside loops generated as code, allowing work to scale with input size in a controlled manner.
Costs, trade-offs and limitations
It is also worth noting some disadvantages of this method. RLMs do not eliminate computational costs. They move it. Instead of paying for a single very large model call, you pay for many smaller calls, as well as the overhead of running and orchestrating the code. In many cases, the total cost is comparable to a standard long pop-up call, but the variance may be higher. There are also practical challenges. The model must be able to write reliable code. Poorly constrained models may generate too many subcalls or fail to complete correctly. Output protocols should be carefully designed to distinguish intermediate steps from final responses. These are engineering problems, not design flaws, but they are important nonetheless.
Conclusion and references
A useful rule of thumb is: If your task becomes more difficult simply because typing takes longer, and summarization or retrieval risks losing important information, an RLM is probably worth considering. If the input is short and the task simple, a standard language model call will generally be faster and cheaper. If you want to explore recursive language patterns in more depth, the following resources are useful starting points:
Source: Here

