Aria: An Open Multimodal Native Mixture-of-Experts Model

Read original: arXiv:2410.05993 - Published 10/10/2024 by Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li

Aria: An Open Multimodal Native Mixture-of-Experts Model

Overview

Aria is a multimodal machine learning model that can process and generate text, images, and other modalities
It is an "open" model, meaning its architecture and code are publicly available
Aria uses a "mixture-of-experts" approach, where different neural network components ("experts") specialize in different subtasks
This allows Aria to be highly capable across a wide range of multimodal tasks

Plain English Explanation

Aria is a type of artificial intelligence (AI) system that can work with different kinds of data, like text and images. It's called a "multimodal" model because it can handle multiple types of information at once.

What makes Aria special is that it has a mixture-of-experts architecture. This means the model is made up of several specialized "expert" components, each focused on a particular subtask. For example, one expert might be really good at understanding text, while another is better at analyzing images.

By combining these specialized experts, Aria can tackle a wide variety of multimodal challenges, like generating captions for images or answering questions about the content of a document. And since Aria's architecture and code are publicly available, researchers and developers can explore and build upon this open model.

Technical Explanation

Aria is a large, open-source multimodal model that uses a mixture-of-experts approach. The model is composed of multiple neural network "experts," each specializing in a particular subtask or modality.

The experts are organized into a hierarchical structure, with a "gating network" that dynamically routes inputs to the appropriate expert(s) based on the task at hand. This allows Aria to leverage the specialized capabilities of its individual experts while maintaining overall flexibility and versatility.

Aria's modular design enables it to be easily extended and customized for a wide range of multimodal applications, such as image captioning, visual question answering, and radiology diagnostics. The open-source nature of the model also facilitates collaborative research and development efforts within the broader AI community.

Critical Analysis

The researchers behind Aria acknowledge several potential limitations and areas for future work. For example, they note that the mixture-of-experts approach can be computationally intensive, especially as the number of experts grows. Additionally, the gating network responsible for routing inputs to the experts may not always make optimal decisions, which could impact the model's overall performance.

Another area for further research is the interpretability and explainability of the Aria model. As a large, complex system, it may be challenging to understand the reasoning behind its decisions and outputs. Developing techniques to improve the model's transparency could be an important step in building trust and ensuring responsible development of such powerful multimodal AI systems.

Despite these challenges, Aria represents a significant advancement in the field of multimodal machine learning. By embracing an open and modular architecture, the researchers have created a flexible platform that can be further refined and tailored to meet the evolving needs of the AI community and society at large.

Conclusion

Aria is an innovative multimodal model that leverages a mixture-of-experts approach to achieve high performance across a wide range of tasks. Its open-source nature and modular design make it a valuable tool for researchers and developers working to push the boundaries of what's possible with AI.

While Aria faces some technical challenges, such as computational efficiency and model interpretability, the researchers have laid the groundwork for a highly capable and customizable multimodal system. As the field of AI continues to evolve, models like Aria will play a crucial role in unlocking new applications and driving impactful breakthroughs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Aria: An Open Multimodal Native Mixture-of-Experts Model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li

Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.

10/10/2024

MAIRA-1: A specialised large multimodal model for radiology report generation

Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fernando P'erez-Garc'ia, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, Noel Codella, Matthew P. Lungren, Maria Teodora Wetscherek, Ozan Oktay, Javier Alvarez-Valle

We present a radiology-specific multimodal model for the task for generating radiological reports from chest X-rays (CXRs). Our work builds on the idea that large language model(s) can be equipped with multimodal capabilities through alignment with pre-trained vision encoders. On natural images, this has been shown to allow multimodal models to gain image understanding and description capabilities. Our proposed model (MAIRA-1) leverages a CXR-specific image encoder in conjunction with a fine-tuned large language model based on Vicuna-7B, and text-based data augmentation, to produce reports with state-of-the-art quality. In particular, MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered. Manual review of model outputs demonstrates promising fluency and accuracy of generated reports while uncovering failure modes not captured by existing evaluation practices. More information and resources can be found on the project website: https://aka.ms/maira.

4/29/2024

A Multimodal Automated Interpretability Agent

Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba

This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.

4/23/2024

MIO: A Foundation Model on Multimodal Tokens

Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

9/27/2024