Identifying User Goals from UI Trajectories

Read original: arXiv:2406.14314 - Published 7/2/2024 by Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan

Identifying User Goals from UI Trajectories

Overview

This paper presents a novel approach to identifying user goals from their interactions with user interfaces (UIs) over time.
The researchers propose a method that can infer a user's intent or objective based on their trajectory through the UI, rather than relying solely on the final actions taken.
This could have important applications in designing intent-based user interfaces, multimodal user interaction, and automating GUI interactions from instructional videos.

Plain English Explanation

When we use software or interact with digital interfaces, we often have a specific goal or objective in mind, even if it's not always clear to the system. This paper proposes a way for the computer to better understand what the user is trying to achieve based on how they navigate through the interface, rather than just looking at their final actions.

Imagine you're trying to book a flight online. You might click around, explore different options, and even end up on a page that doesn't directly lead to booking a ticket. But the path you take through the website can reveal your underlying intent - in this case, to find and book a flight. The researchers developed a technique that can analyze this "trajectory" of your interactions to infer your goal, rather than just looking at the last thing you did.

This could be really useful for designing interfaces that are more responsive to user intent, or for automating the completion of tasks by understanding the user's objective. It could also help with multimodal user interaction, where the system needs to infer intent from a combination of different inputs like speech, gestures, and UI interactions.

Technical Explanation

The key idea behind this work is to model user goals as latent variables that can be inferred from the trajectory of their interactions with a user interface over time. Rather than just looking at the final actions taken, the researchers propose a method that can capture the evolving intent of the user as they navigate through the UI.

The approach involves building a neural network-based model that takes in a sequence of user interactions (e.g., clicks, scrolls, hovers) and outputs a probability distribution over possible user goals. The model is trained on labeled data, where the ground truth user goals are known, allowing it to learn the patterns and features that distinguish different intents.

Importantly, the model operates in an online fashion, updating its understanding of the user's goal at each step of the interaction. This allows it to adapt to changes in intent over the course of a task, rather than just making a single prediction at the end.

The researchers evaluate their approach on several benchmark datasets, including GUICourse and VideoGUI, and show that it outperforms various baseline methods in accurately inferring user goals from UI interaction logs.

Critical Analysis

The proposed approach represents an important step forward in understanding user intent from interaction data, but it also has some limitations and caveats that merit further discussion.

One key challenge is the reliance on labeled training data, where the ground truth user goals are known. In many real-world scenarios, this type of labeled data may be difficult or expensive to obtain, which could limit the practical applicability of the method. The authors acknowledge this and suggest exploring unsupervised or few-shot learning approaches as a potential solution.

Additionally, the evaluation is primarily conducted on simulated or curated datasets, which may not fully capture the complexities and noise inherent in real-world user interactions. Further testing on diverse, in-the-wild datasets would be valuable to assess the method's robustness and generalizability.

Another area for potential improvement is the model's ability to handle changes in user intent over the course of an interaction. While the online nature of the approach is a step in the right direction, there may be opportunities to further enhance the model's adaptability and flexibility in the face of evolving user goals.

Despite these limitations, the core idea of leveraging interaction trajectories to infer user intent is a promising direction for the field of intent-based user interfaces and multimodal interaction. The authors' work serves as a valuable foundation for future research in this area.

Conclusion

This paper presents a novel approach to identifying user goals from UI interaction trajectories, which could have important implications for the design of more responsive and intelligent user interfaces. By modeling user intent as a latent variable that can be inferred from interaction data, the researchers have developed a technique that can adapt to evolving user goals over time.

While the method has some limitations, such as the reliance on labeled training data and the need for further evaluation on real-world scenarios, the core idea represents a significant advancement in the field of intent-based user interfaces and multimodal interaction. As researchers continue to build on this work, we may see more intelligent and personalized user experiences that better understand and anticipate the user's goals and objectives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Identifying User Goals from UI Trajectories

Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan

Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.

7/2/2024

Towards Intent-based User Interfaces: Charting the Design Space of Intent-AI Interactions Across Task Types

Zijian Ding

Technological advances continue to redefine the dynamics of human-machine interactions, particularly in task execution. This proposal responds to the advancements in Generative AI by outlining a research plan that probes intent-AI interaction across a diverse set of tasks: fixed-scope content curation task, atomic creative tasks, and complex and interdependent tasks. This exploration aims to inform and contribute to the development of Intent-based User Interface (IUI). The study is structured in three phases: examining fixed-scope tasks through news headline generation, exploring atomic creative tasks via analogy generation, and delving into complex tasks through exploratory visual data analysis. Future work will focus on improving IUIs to better provide suggestions to encourage experienced users to express broad and exploratory intents, and detailed and structured guidance for novice users to iterate on analysis intents for high quality outputs.

5/3/2024

👁️

You Only Look at Screens: Multimodal Chain-of-Action Agents

Zhuosheng Zhang, Aston Zhang

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-GUI.

6/10/2024

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin

Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, Intent in the Wild (IIW) and Intent in the Tame (IIT), designed for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve user intent predictions that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by intent similarity scores, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding.

9/17/2024