In this article, you will learn how to build efficient long context retrieval augmented generation (RAG) systems using modern techniques that address attention limitations and cost concerns.
Topics we will cover include:
- How reclassification alleviates the “lost in the middle” problem.
- How contextual caching reduces latency and computational costs.
- How hybrid retrieval, metadata filtering, and query expansion improve relevance.
Introduction
Recovery augmented generation (RAG) is experiencing a major change. For years, RAG’s mantra was simple: “Divide your documents into smaller pieces, integrate them, and get the most relevant pieces.” This was necessary because large language models (LLMs) had expensive and limited context windows, typically ranging from 4,000 to 32,000 tokens.
Now, models like Gemini Pro and Claude Opus have exceeded these limits, offering pop-ups of 1 million tokens or more. In theory, you can now paste an entire collection of novels into a prompt. In practice, however, this capability introduces two major challenges:
- The “lost in the middle” problem: Research has shown that models often ignore information placed in the middle of a massive prompt, favoring the beginning and end.
- The cost problem: Processing a million tokens for each query is expensive and computationally slow. It’s like re-reading an entire encyclopedia every time someone asks a simple question.
This tutorial explores five practical techniques for creating effective long-context RAG systems. We go beyond simple partitioning and examine strategies to mitigate attention loss and enable context reuse from a developer perspective.
1. Implement a Reclassification Architecture to Fight Against “Lost in the Middle”
The “Lost in the Middle” problem, identified in a 2023 study by Stanford and UC Berkeley, reveals a critical limitation of the LLM’s attention mechanisms. When presented in a long context, model performance peaks when relevant information appears at the beginning or end. Information buried in the middle is much more likely to be ignored or misinterpreted.
Instead of inserting the retrieved documents directly into the prompt in their original order, introduce a reordering step.
Here is the developer workflow:
- Recovery: Use a standard vector database (such as Pinecone or Weaviate) to retrieve a larger set of candidates (e.g., top 20 instead of top 5).
- Reclassification: Run these candidates through a specialized multi-encoder reranker (such as the Cohere Rerank API or a Sentence-Transformers cross-encoder model) that scores each document against the query.
- Reorganization: Select the 5 most relevant documents.
- Contextual placement: Place the most relevant document at the beginning and the second most relevant at the end of the prompt. Place the other three in the middle.
This strategic placement ensures that the most important information receives maximum attention.
2. Leverage Contextual Caching for Repeating Queries
Long contexts introduce latency and additional costs. Repeatedly processing hundreds of thousands of tokens is inefficient. Contextual caching solves this problem.
Think of this as initializing a persistent context for your model.
- Create the cache: Upload a large document (e.g., a 500,000 token manual) once via an API and set a lifetime (TTL).
- Reference the cache: For subsequent queries, send only the user’s question and a reference ID to the cached context.
- Cost savings: You reduce input token costs and latency since the document does not need to be reprocessed each time.
This approach is particularly useful for chatbots built on static knowledge bases.
3. Using Dynamic Contextual Grouping with Metadata Filters
Even with large pop-ups, relevance is still key. Simply increasing the context size does not eliminate the noise.
This approach improves on traditional chunking with structured metadata.
- Smart grouping: Divide documents into segments (e.g., 500-1000 tokens) and attach metadata such as source, section title, page number, and summaries.
- Hybrid filtering: Use a two-step recovery process:
- Metadata filtering: Refine the search space based on structured attributes (e.g., date ranges or document sections).
- Semantic search: perform a similarity search only on filtered candidates.
This reduces irrelevant context and improves accuracy.
4. Combine Keyword and Semantic Search with Hybrid Retrieval
Vector search captures meaning but can miss exact keyword matches, which are essential for technical queries.
Hybrid search combines semantic and keyword-based retrieval.
- Double recovery:
- Vector database for semantic similarity.
- Keyword index (e.g., Elasticsearch) for exact matches.
- Merger: Use Reciprocal Rank Fusion (RRF) to combine rankings, prioritizing results that score high in both systems.
- Context Population: Insert the merged results into the prompt using reordering principles.
This ensures both semantic relevance and lexical correctness.
5. Applying Query Expansion with Summarize-Then-Retrieve
User queries often differ from the way information is expressed in documents. Query expansion helps fill this gap.
Use a lightweight LLM to generate alternative search queries.
This improves performance on inferential and vaguely worded queries.
Conclusion
The emergence of million-token popups doesn’t eliminate the need for recovery-augmented generation – it reshapes it. Although long contexts reduce the need for aggressive slicing, they introduce challenges related to attention allocation and costs.
By applying reordering, contextual caching, metadata filtering, hybrid retrieval, and query expansion, you can create systems that are both scalable and accurate. The goal is not simply to provide more context, but to ensure that the model consistently focuses on the most relevant information.
References
- How language models use long contexts
- Gemini API: Context Caching
- Rerank – The Power of Semantic Search (Cohere)
- The probabilistic relevance framework
About Shittu Olumide
Shittu Olumide is a software engineer and technical writer with a passion for leveraging cutting-edge technologies to create compelling stories, with a keen eye for detail and a talent for simplifying complex concepts. You can also find Shittu on Twitter.
For more information, visit the source Here.
“`

