OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Read original: arXiv:2407.19056 - Published 7/30/2024 by Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Overview

OfficeBench is a new benchmark for evaluating language agents across multiple office-related applications.
The benchmark covers a diverse set of tasks, including document editing, calendar management, email handling, and more.
The goal is to measure the capabilities of language agents in realistic office automation scenarios.

Plain English Explanation

The OfficeBench paper introduces a new benchmark for assessing the performance of language-based AI agents in office-related tasks. Unlike previous benchmarks that focused on narrow domains, OfficeBench aims to provide a more comprehensive evaluation by covering a wide range of common office automation activities, such as document editing, calendar management, and email handling.

The researchers developed this benchmark to better understand the capabilities of modern language models when applied to real-world office scenarios. By evaluating agents across multiple applications, the goal is to identify their strengths, weaknesses, and areas for improvement in order to advance the state of the art in office automation.

Technical Explanation

The OfficeBench benchmark consists of a diverse set of tasks that cover common office-related activities. These include:

Document editing: Tasks like summarizing, formatting, and revising text documents
Calendar management: Scheduling meetings, updating event details, and handling conflicts
Email handling: Composing, replying to, and organizing email communications
And other office-centric tasks like data analysis, presentation creation, and task planning

The benchmark provides a diverse set of test cases, input data, and evaluation metrics to measure the performance of language agents across these domains. By testing agents in a holistic office environment, the researchers aim to gain insights into their real-world applicability and limitations.

Critical Analysis

The OfficeBench paper presents a comprehensive and well-designed benchmark for evaluating language agents in office automation scenarios. However, the authors acknowledge certain limitations of their approach, such as the challenge of simulating the full complexity of real-world office dynamics and the potential bias introduced by the specific dataset and task selection.

Additionally, the paper does not provide a detailed analysis of the performance of existing language models on the benchmark, leaving room for further research and comparative studies. It would be valuable to see how current state-of-the-art agents fare on the OfficeBench tasks and identify areas for improvement.

Conclusion

The OfficeBench benchmark represents a significant step forward in evaluating the capabilities of language-based AI agents in office automation scenarios. By covering a diverse range of tasks, the benchmark aims to provide a more realistic assessment of an agent's suitability for real-world office environments.

The development of this benchmark is an important contribution to the field, as it can help drive the advancement of language models and their applications in office productivity and automation. As the research in this area continues to evolve, the insights gained from OfficeBench can inform the design of more effective and versatile language agents for office-related tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang

Office automation significantly enhances human productivity by automatically finishing routine tasks in the workflow. Beyond the basic information extraction studied in much of the prior document AI literature, the office automation research should be extended to more realistic office tasks which require to integrate various information sources in the office system and produce outputs through a series of decision-making processes. We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic office workflows. OfficeBench requires LLM agents to perform feasible long-horizon planning, proficiently switch between applications in a timely manner, and accurately ground their actions within a large combined action space, based on the contextual demands of the workflow. Applying our customized evaluation methods on each task, we find that GPT-4 Omni achieves the highest pass rate of 47.00%, demonstrating a decent performance in handling office tasks. However, this is still far below the human performance and accuracy standards required by real-world office workflows. We further observe that most issues are related to operation redundancy and hallucinations, as well as limitations in switching between multiple applications, which may provide valuable insights for developing effective agent frameworks for office automation.

7/30/2024

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as Play the next song, as well as longer horizon tasks such as Send an email to John Doe mentioning the time and place to meet. Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

7/23/2024

💬

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.

8/22/2024

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

7/23/2024