Wings: Learning Multimodal LLMs without Text-only Forgetting

Read original: arXiv:2406.03496 - Published 6/6/2024 by Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Wings: Learning Multimodal LLMs without Text-only Forgetting

Overview

This paper explores the problem of "attention shift" in multimodal large language models (LLMs), where the model's attention can shift away from visual inputs when trained on both text and images.
The researchers introduce "Wings", a novel training approach that aims to mitigate this issue and enable multimodal LLMs to learn effectively from both text and visual data without forgetting their text-only capabilities.
The paper presents empirical results showing that Wings can improve the multimodal performance of LLMs while preserving their text-only performance, addressing a key challenge in the field of multimodal AI.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human language. Some of these models, known as multimodal LLMs, are trained on both text and visual data, allowing them to process and generate content that combines language and images.

One problem with multimodal LLMs is that their attention, or focus, can shift too much towards the visual inputs during training, causing them to lose some of their ability to understand and generate text-only content. This "attention shift" issue can limit the usefulness of these models in real-world applications.

The Wings approach introduced in this paper aims to address this problem. By modifying the training process, Wings helps multimodal LLMs learn effectively from both text and visual data without forgetting their text-only capabilities. This allows the models to maintain strong performance across both text-only and multimodal tasks.

The researchers provide experimental results demonstrating the benefits of the Wings approach, showing that it can improve the multimodal performance of LLMs while preserving their text-only performance. This is an important step forward in developing more versatile and capable multimodal AI systems.

Technical Explanation

The paper begins by highlighting the attention shift issue in multimodal LLMs, where the model's attention can become overly focused on visual inputs during training, leading to a decline in text-only performance. To address this, the researchers propose the "Wings" training approach, which consists of two key components:

Multimodal Prompt Tuning: Instead of fine-tuning the entire model on multimodal data, Wings only updates a small subset of the model's parameters, known as the "prompt tuning" module. This helps preserve the model's text-only capabilities.
Cross-modal Distillation: Wings introduces a distillation loss that encourages the model to retain its text-only performance by learning to predict the output of the text-only version of the model when presented with text-only inputs.

The researchers evaluate Wings on various multimodal benchmarks, including visual question answering and image captioning tasks. The results show that Wings can significantly improve the multimodal performance of LLMs while maintaining their text-only capabilities, addressing the attention shift issue.

Critical Analysis

The paper provides a well-designed and thorough investigation of the attention shift problem in multimodal LLMs, offering a novel solution in the form of the Wings approach. The experimental results are promising and demonstrate the potential of Wings to enable more robust and versatile multimodal AI systems.

However, the paper does not discuss certain limitations or potential concerns. For example, the effectiveness of Wings may be dependent on the specific architecture and training regime of the underlying LLM, and it's unclear how well the approach would generalize to a wider range of multimodal tasks and datasets.

Additionally, the paper does not address the computational and memory overhead associated with the prompt tuning and cross-modal distillation components of Wings, which could be a concern for deploying these models in resource-constrained environments.

Further research could explore the scalability of Wings, its performance on a more diverse set of multimodal benchmarks, and potential trade-offs or edge cases that may arise when applying the approach to different types of multimodal LLMs or tasks.

Conclusion

The Wings training approach presented in this paper offers a promising solution to the attention shift problem in multimodal LLMs. By preserving the text-only capabilities of these models while enhancing their multimodal performance, Wings represents an important step forward in the development of more versatile and capable multimodal AI systems.

The empirical results demonstrate the effectiveness of the Wings approach, and the underlying ideas behind it could inspire further advancements in the field of multimodal machine learning. As the research in this area continues to evolve, addressing challenges like attention shift will be crucial for unlocking the full potential of multimodal AI in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Wings: Learning Multimodal LLMs without Text-only Forgetting

Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like wings on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

6/6/2024

💬

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

Jun Gao, Qian Qiao, Ziqiang Cao, Zili Wang, Wenjie Li

In-context learning (ICL) facilitates Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multi-modal Large Language Models (MLLMs), two problems hinder the application of multi-modal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM tends to focus more on the linguistic modality within multi-modal demonstrations to generate responses. Therefore, we propose a general and light-weighted framework textbf{AIM} to tackle the mentioned problems through textbf{A}ggregating textbf{I}mage information of textbf{M}ultimodal demonstrations to the dense latent space of the corresponding linguistic part. Specifically, AIM first uses the frozen backbone MLLM to read each image-text demonstration and extracts the vector representations on top of the text. These vectors naturally fuse the information of the image-text pair, and AIM transforms them into fused virtual tokens acceptable for the inner LLM via a trainable projection layer. Ultimately, these fused tokens function as variants of multi-modal demonstrations, fed into the MLLM to direct its response to the current query as usual. Because these fused tokens stem from the textual component of the image-text pair, a multi-modal demonstration is nearly reduced to a pure textual demonstration, thus seamlessly applying to any MLLMs. With its de facto MLLM frozen, AIM is parameter-efficient and we train it on public multi-modal web corpora which have nothing to do with downstream test tasks.

7/2/2024

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available. Code is available at: https://github.com/AIDC-AI/Parrot.

8/13/2024

The Revolution of Multimodal Large Language Models: A Survey

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.

6/7/2024