Revolutionizing User Experience: How Advanced AI Agents Anticipate User Needs
As artificial intelligence technologies continue to evolve, the potential for creating truly useful agents has never been more promising. The key to enhancing user experiences, particularly on mobile devices, lies in the ability of underlying models to understand user actions and intentions. By grasping what a user is doing—or attempting to do—when interacting with their device, these models can provide more relevant and anticipatory suggestions. For instance, if a user has previously searched for music festivals across Europe and is now looking for a flight to London, an intelligent agent could proactively suggest festivals happening in London during the specified dates.
The Role of Large Multimodal Language Models (LLMs)
Large multimodal language models (LLMs) have already shown significant proficiency in interpreting user intent through user interface (UI) trajectories. However, leveraging LLMs for this purpose often involves sending information to a remote server. This process can be slow, costly, and may expose sensitive information, raising privacy concerns.
Introducing Small Multimodal LLMs: A Breakthrough in Intent Extraction
In our recent paper, “Small Models, Big Results: Achieving Superior Intent Extraction Through Decomposition,” presented at EMNLP 2025, we explore how small multimodal LLMs (MLLMs) can effectively comprehend user interaction sequences across the web and mobile devices, all while processing data locally on the device. By deconstructing the understanding of user intent into two distinct steps—first summarizing each screen individually and then deriving an overall intent from the sequence of summaries—we simplify the task for smaller models. This innovative approach allows these models to perform comparably to their larger counterparts, demonstrating significant potential for on-device applications.
Evaluating Model Performance
To assess the effectiveness of our approach, we have formalized metrics for evaluating model performance. Our findings indicate that this method achieves results that rival those of much larger models. This advancement not only highlights the capability of small models but also emphasizes the feasibility of deploying these models in real-world, on-device scenarios. This study builds upon our team’s previous research on understanding user intent, further cementing our expertise in the field.
For a deeper dive into our methods and findings, you can access the full paper Here.
“`

