Experiments and Results
In the rapidly evolving field of artificial intelligence, the challenge of retrieving accurate and contextually relevant information from vast datasets is of paramount importance. A recent evaluation of agentic Retrieval-Augmented Generation (RAG) on the FramesQA dataset provides illuminating insights into this challenge. FramesQA is intricately based on the FRAMES article and is designed to test the ability of systems to answer complex, multi-step questions.
Understanding the Challenge
An example question from FramesQA that highlights the complexity of these queries is: “Of the two most-watched TV season finales (as of June 2024), which lasted the longest and by how much?” To correctly answer this, a RAG system needs to perform several precise steps. Initially, it must identify the two most-watched finales, which are from the shows M*A*S*H and Cheers. Subsequently, it must determine their respective runtimes and calculate the difference.
Limitations of Traditional RAG Systems
In many conventional RAG settings, whether vanilla RAG or even agentic RAG without adequate context, systems might struggle. A typical response might be: “Despite multiple analyses, I found no explicit runtime for M*A*S*H or Cheers. The documents provide audience data, but not duration in minutes or hours.” Such answers highlight the limitations of systems lacking a nuanced understanding of the context.
Advancements with Agentic RAG
Fortunately, the agentic RAG system overcomes these limitations by employing a more sophisticated approach. It first identifies the relevant TV shows and then uses a Query Rewriter in conjunction with a sufficient context agent to perform a targeted search. This methodology allows the system to accurately retrieve the necessary data on the runtime of M*A*S*H or Cheers. For example, it confidently answers: “The M*A*S*H finale lasted 150 minutes, making it the longest of the first two. It lasted 52 minutes longer than the Cheers finale, which lasted approximately 98 minutes.”
Empirical Evaluation
To rigorously assess this capability, an extensive experiment was conducted using the FramesQA dataset, which includes 824 queries and a corpus of 2,676 PDF documents. The test compared a “Vanilla” RAG setting, utilizing Google’s advanced RAG engine, with the agentic RAG under two distinct contexts. In the single corpus setting, FramesQA documents were retrieved, while in a cross-corpus framework, three additional challenging datasets were incorporated. This multi-corpus scenario simulates real-world situations where organizations manage data across separate teams.
Results and Implications
The results were promising. In the multi-corpus context, the agentic RAG nearly matched its accuracy on a single corpus, correctly answering 90.1% of the questions even when selecting from four potential corpora. Notably, the latency between single and multi-corpus versions remained comparable, within a 3% variance on average. These findings underscore the system’s capability to reason across diverse, unrelated data sources, potentially enhancing flexibility in retrieval scenarios.
This research highlights the potential of agentic RAG systems to transform data retrieval processes, offering more reliable and contextually accurate responses. Such advancements could significantly benefit various sectors, including research, education, and corporate environments.
For further details, the complete study can be accessed Here.
“`

