Octopus v2: On-device language model for super agent

2404.01744

Published 4/17/2024 by Wei Chen, Zhiyuan Li

Octopus v2: On-device language model for super agent

Abstract

Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Presents a new on-device language model called Octopus v2 for enhancing the capabilities of software agents
Leverages large language models to enable more natural and flexible interactions with software agents
Focuses on improving the agent's ability to understand and respond to natural language commands

Plain English Explanation

Octopus v2 is a new language model designed to be deployed on devices to enhance the capabilities of software agents, such as virtual assistants or chatbots. The key idea is to use large language models, which are powerful machine learning models trained on vast amounts of text data, to enable more natural and flexible interactions between users and software agents.

Traditionally, software agents have relied on predefined commands or templates to understand and respond to user input. However, this can be limiting, as users may want to interact with the agent in more natural, conversational ways. By incorporating a large language model like Octopus v2, the agent can better understand the context and intent behind a user's request, allowing for more nuanced and helpful responses.

For example, instead of having to say "Call John Smith" to initiate a phone call, a user could say "I need to talk to John about the project" and the agent would recognize the intent to make a call. This can make the interaction feel more natural and intuitive for the user.

The paper also discusses how Octopus v2 and similar language model-based approaches can be used to enhance the general capabilities of software agents, enabling them to assist with a wider range of tasks beyond just command execution.

Technical Explanation

The Octopus v2 paper presents a new on-device language model designed to improve the natural language understanding and generation capabilities of software agents. The key innovation is the use of a compact, efficient language model that can be deployed directly on the device, rather than relying on a remote server.

The authors leverage insights from recent work on large language models for spoken language understanding to develop a model that can understand and respond to natural language commands and queries. The model is trained on a diverse dataset of user interactions, allowing it to learn patterns and associations that enable more flexible and contextual interpretation of user input.

The paper also discusses techniques for enhancing the general capabilities of software agents using low-parameter language models, which can be particularly useful for deploying on resource-constrained devices.

Critical Analysis

The Octopus v2 paper presents a promising approach for improving the natural language understanding and generation capabilities of software agents. By leveraging large language models, the authors demonstrate how agents can engage in more natural, conversational interactions with users.

One potential limitation discussed in the paper is the need to carefully manage the trade-offs between model size, performance, and deployment constraints, especially for on-device implementations. The authors acknowledge that further research may be needed to find the right balance for different application scenarios.

Additionally, the paper does not address some of the broader challenges and considerations around large language model-based autonomous agents, such as issues of safety, fairness, and transparency. These are important areas that will likely require further exploration as the field of Transformer-Lite and other efficient language model deployments continues to evolve.

Conclusion

The Octopus v2 paper presents an innovative approach to enhancing the natural language capabilities of software agents through the use of a compact, on-device language model. By leveraging large language models, the authors demonstrate how agents can engage in more flexible and contextual interactions, moving beyond the limitations of traditional command-based systems.

This work has the potential to significantly improve the user experience and overall capabilities of a wide range of software agents, from virtual assistants to chatbots. As the field of large language model-based autonomous agents continues to evolve, the insights and techniques presented in the Octopus v2 paper will likely be valuable for researchers and developers seeking to push the boundaries of agent-user interaction.

Related Papers

Octopus: On-device language model for function calling of software APIs

Wei Chen, Zhiyuan Li, Mingyuan Ma

In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models' grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose textit{conditional masking} techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is proved to have better performance than GPT-4 for the software APIs calling. This research aims to advance automated software development and API integration, representing substantial progress in aligning LLM capabilities with the demands of practical software engineering applications.

4/3/2024

cs.CL cs.SE

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

Wei Chen, Zhiyuan Li

A multimodal AI agent is characterized by its ability to process and learn from various types of data, including natural language, visual, and audio inputs, to inform its actions. Despite advancements in large language models that incorporate visual data, such as GPT-4V, effectively translating image-based data into actionable outcomes for AI agents continues to be challenging. In this paper, we introduce a multimodal model that incorporates the concept of functional token specifically designed for AI agent applications. To ensure compatibility with edge devices, our model is optimized to a compact size of less than 1B parameters. Like GPT-4, our model can process both English and Chinese. We demonstrate that this model is capable of operating efficiently on a wide range of edge devices, including as constrained as a Raspberry Pi.

4/19/2024

cs.CL cs.CV

💬

Octopus v4: Graph of language models

Wei Chen, Zhiyuan Li

Language models have been effective in a wide range of applications, yet the most sophisticated models are often proprietary. For example, GPT-4 by OpenAI and various models by Anthropic are expensive and consume substantial energy. In contrast, the open-source community has produced competitive models, like Llama3. Furthermore, niche-specific smaller language models, such as those tailored for legal, medical or financial tasks, have outperformed their proprietary counterparts. This paper introduces a novel approach that employs textit{functional tokens} to integrate textbf{multiple open-source models}, each optimized for particular tasks. Our newly developed Octopus v4 model leverages textit{functional tokens} to intelligently direct user queries to the most appropriate vertical model and reformat the query to achieve the best performance. Octopus v4, an evolution of the Octopus v1, v2, and v3 models, excels in selection and parameter understanding and reformatting. Additionally, we explore the use of graph as a versatile data structure that effectively coordinates multiple open-source models by harnessing the capabilities of the Octopus model and textit{functional tokens}. Use our open-sourced GitHub (url{https://www.nexa4ai.com/}) to try Octopus v4 models (url{https://huggingface.co/NexaAIDev/Octopus-v4}), and contrite to a larger graph of language models. By activating models less than 10B parameters, we achieved SOTA MMLU score of 74.8 among the same level models.

5/1/2024

cs.CL

Training a Vision Language Model as Smartphone Assistant

Nicolai Dorka, Janusz Marecki, Ammar Anwar

Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.

4/16/2024

cs.LG cs.AI cs.CV cs.HC