Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

2404.11459

Published 4/19/2024 by Wei Chen, Zhiyuan Li

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

Abstract

A multimodal AI agent is characterized by its ability to process and learn from various types of data, including natural language, visual, and audio inputs, to inform its actions. Despite advancements in large language models that incorporate visual data, such as GPT-4V, effectively translating image-based data into actionable outcomes for AI agents continues to be challenging. In this paper, we introduce a multimodal model that incorporates the concept of functional token specifically designed for AI agent applications. To ensure compatibility with edge devices, our model is optimized to a compact size of less than 1B parameters. Like GPT-4, our model can process both English and Chinese. We demonstrate that this model is capable of operating efficiently on a wide range of edge devices, including as constrained as a Raspberry Pi.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This technical report introduces Octopus v3, a sub-billion parameter multimodal AI agent designed to run on-device.
The paper explores methods for building highly capable AI systems that can operate on mobile and edge devices with limited computational resources.
Key innovations include efficient multimodal fusion, mobile-optimized model architectures, and unsupervised pretraining techniques.

Plain English Explanation

The researchers have developed a new type of artificial intelligence (AI) system called Octopus v3. Unlike many large AI models that require powerful computers to run, Octopus v3 is designed to work on smaller, mobile devices like smartphones and tablets.

Octopus v2: On-device Language Model Super-Agent and Octopus: On-device Language Model for Function Calling in Software were earlier versions of this technology. This latest version, Octopus v3, builds on that work to create an even more capable and efficient multimodal AI agent that can understand and generate text, images, and other data types.

The key innovations in Octopus v3 include:

Efficient multimodal fusion: A way to combine information from different data sources (like text and images) that is optimized for mobile devices.
Mobile-optimized model architectures: Model designs that are streamlined to run smoothly on smartphones, tablets, and other resource-constrained hardware.
Unsupervised pretraining techniques: Methods for training the AI system on large datasets without manual labeling, which can make it more adaptable and capable.

By developing this type of on-device AI, the researchers aim to unlock the potential of intelligent, autonomous agents that can operate directly on mobile devices, without needing to send data to powerful cloud servers. This could enable a wide range of new applications and services that are more private, responsive, and energy-efficient.

Technical Explanation

The core of Octopus v3 is a multimodal transformer-based architecture that can process and generate text, images, and other data types. Octopus: On-device Language Model for Function Calling in Software and Mobile Agent: Autonomous Multi-modal Mobile Device describe earlier versions of this model.

To enable efficient on-device operation, the researchers use a number of key techniques:

Multimodal Fusion: They introduce a novel multimodal fusion module that can effectively combine information from different input modalities, while maintaining a compact model size.
Mobile-Optimized Architectures: The overall model architecture is streamlined and optimized for mobile devices, with careful attention paid to parameters, compute, and memory usage.
Unsupervised Pretraining: Octopus v3 is pretrained on large unlabeled datasets using self-supervised techniques, which can imbue the model with rich knowledge and capabilities without the need for expensive manual annotations.

These innovations allow Octopus v3 to achieve strong performance on a variety of multimodal tasks, while keeping the total model size under 1 billion parameters - small enough to run efficiently on modern mobile devices.

Critical Analysis

The researchers have made some compelling advancements in on-device multimodal AI with Octopus v3. Review of Multi-modal Large Language and Vision Models provides helpful context on the broader landscape of these types of models.

One potential limitation is that the paper does not provide a detailed quantitative comparison of Octopus v3's performance against other state-of-the-art on-device models. It would be useful to see how it fares on standardized benchmarks to better understand its capabilities relative to the competition.

Additionally, the authors note that Octopus v3 is still a sub-billion parameter model, which means it may not be able to match the sheer scale and breadth of capabilities offered by the largest multimodal AI systems. Further research would be needed to understand the practical tradeoffs and use cases for this type of compact, on-device agent versus more resource-intensive cloud-based models.

Overall, the work presented in this technical report represents an important step forward in making powerful AI systems more accessible and practical for real-world mobile and edge computing applications. The innovations around multimodal fusion, mobile-optimized architectures, and unsupervised pretraining are insightful and worthy of further exploration and development.

Conclusion

The Octopus v3 technical report describes a new sub-billion parameter multimodal AI agent designed to run efficiently on mobile and edge devices. Key innovations include efficient multimodal fusion, mobile-optimized model architectures, and unsupervised pretraining techniques.

By developing this type of on-device AI capability, the researchers aim to unlock new applications and services that are more private, responsive, and energy-efficient than traditional cloud-based approaches. While there are still some open questions and potential limitations, Octopus v3 represents an important advancement in making powerful AI more accessible and practical for a wide range of real-world use cases.

Omnifusion: Technical Report provides additional context on related work in multimodal AI fusion techniques.

Related Papers

Octopus v2: On-device language model for super agent

Wei Chen, Zhiyuan Li

Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

4/17/2024

cs.CL

💬

Octopus v4: Graph of language models

Wei Chen, Zhiyuan Li

Language models have been effective in a wide range of applications, yet the most sophisticated models are often proprietary. For example, GPT-4 by OpenAI and various models by Anthropic are expensive and consume substantial energy. In contrast, the open-source community has produced competitive models, like Llama3. Furthermore, niche-specific smaller language models, such as those tailored for legal, medical or financial tasks, have outperformed their proprietary counterparts. This paper introduces a novel approach that employs textit{functional tokens} to integrate textbf{multiple open-source models}, each optimized for particular tasks. Our newly developed Octopus v4 model leverages textit{functional tokens} to intelligently direct user queries to the most appropriate vertical model and reformat the query to achieve the best performance. Octopus v4, an evolution of the Octopus v1, v2, and v3 models, excels in selection and parameter understanding and reformatting. Additionally, we explore the use of graph as a versatile data structure that effectively coordinates multiple open-source models by harnessing the capabilities of the Octopus model and textit{functional tokens}. Use our open-sourced GitHub (url{https://www.nexa4ai.com/}) to try Octopus v4 models (url{https://huggingface.co/NexaAIDev/Octopus-v4}), and contrite to a larger graph of language models. By activating models less than 10B parameters, we achieved SOTA MMLU score of 74.8 among the same level models.

5/1/2024

cs.CL

Octopus: On-device language model for function calling of software APIs

Wei Chen, Zhiyuan Li, Mingyuan Ma

In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models' grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose textit{conditional masking} techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is proved to have better performance than GPT-4 for the software APIs calling. This research aims to advance automated software development and API integration, representing substantial progress in aligning LLM capabilities with the demands of practical software engineering applications.

4/3/2024

cs.CL cs.SE

⛏️

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.

4/22/2024

cs.CV cs.AI cs.CL cs.LG