TextSquare: Scaling up Text-Centric Visual Instruction Tuning

2404.12803

Published 4/22/2024 by Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao and 6 others

cs.CV cs.LG

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Abstract

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Create account to get full access

Overview

This paper proposes a new approach called TextSquare for scaling up text-centric visual instruction tuning, which involves training large language models to perform specific tasks using textual instructions and visual inputs.
The authors demonstrate that TextSquare can achieve state-of-the-art performance on various vision-language benchmarks while using significantly fewer parameters than previous models.
Key innovations of TextSquare include a novel text encoding module, a flexible task-agnostic architecture, and efficient cross-modal fusion mechanisms.

Plain English Explanation

The paper introduces a new system called TextSquare that aims to make it easier to train large AI models to perform specific tasks using both text instructions and visual inputs. This is an important area of AI research, as being able to combine language and vision can unlock powerful capabilities for tasks like image captioning, visual question answering, and more.

The key insight behind TextSquare is that you can design the model architecture in a smart way to make the training process more efficient. Specifically, the authors propose a novel text encoding module, a flexible overall model design, and efficient ways to fuse the text and visual information together. This allows TextSquare to achieve state-of-the-art performance on various vision-language benchmarks, while using significantly fewer model parameters than previous approaches.

In other words, TextSquare is a more streamlined and efficient way to build AI systems that can understand and reason about both language and visual inputs. This is an important step forward, as developing powerful multimodal AI models has been a major challenge in the field. By making the training process more efficient, the authors hope that TextSquare can help accelerate progress in this area and unlock new capabilities for real-world applications.

Technical Explanation

The authors propose a new architecture called TextSquare for tackling text-centric visual instruction tasks. Key innovations of TextSquare include:

Novel Text Encoding Module: TextSquare uses a specialized text encoding module that can better capture the semantic and structural information in the input text compared to standard transformer-based encoders. This helps the model understand the instructions more effectively.
Flexible Task-Agnostic Architecture: The overall TextSquare architecture is designed to be flexible and task-agnostic, allowing it to be applied to a wide range of vision-language tasks without major modifications. This contrasts with prior approaches that were more specialized.
Efficient Cross-Modal Fusion: TextSquare employs efficient cross-modal fusion mechanisms to combine the text and visual inputs in a way that captures their interactions while minimizing the model size. This includes the use of modality-specific attention and task-specific output heads.

Through these innovations, the authors demonstrate that TextSquare can achieve state-of-the-art performance on tasks like VIT-VQA, LibrisQA, and JVTP using significantly fewer model parameters compared to previous approaches like TinyGPT-V.

Critical Analysis

The authors acknowledge several limitations and avenues for future work in the paper. For example, they note that TextSquare's performance may degrade on tasks requiring more complex reasoning or multi-step problem-solving. Additionally, the authors suggest that further research is needed to better understand the interplay between text and visual encoding, as well as to explore more efficient cross-modal fusion mechanisms.

One potential area of concern is the reliance on pre-training on large-scale datasets, which can introduce biases and limit the model's generalization to more diverse or niche applications. It would be valuable to investigate techniques for making TextSquare more robust and adaptable to different domains and data distributions.

Furthermore, the authors do not provide a detailed analysis of the computational efficiency and inference latency of TextSquare compared to other models. This information would be important for understanding the practical implications and real-world deployment potential of the system.

Overall, the TextSquare approach represents a promising step forward in the development of efficient and scalable multimodal AI models. However, continued research and careful consideration of the potential limitations and societal implications will be crucial to ensure the responsible and impactful deployment of such technologies.

Conclusion

The TextSquare system proposed in this paper offers a novel and efficient approach to scaling up text-centric visual instruction tuning, a key challenge in the field of multimodal AI. By introducing a specialized text encoding module, a flexible task-agnostic architecture, and efficient cross-modal fusion mechanisms, the authors demonstrate state-of-the-art performance on various vision-language benchmarks using significantly fewer model parameters than previous methods.

While the paper highlights several promising avenues for further research, such as improving the model's reasoning capabilities and exploring more diverse applications, the TextSquare approach represents an important step forward in the development of powerful and efficient multimodal AI systems. As the field continues to evolve, innovations like TextSquare will be crucial for unlocking new capabilities and paving the way for real-world applications that can truly leverage the combined power of language and vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring the Capabilities of Large Multimodal Models on Dense Text

Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.

5/14/2024

cs.CL cs.AI

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but there is limited research focusing on their ability to generate data by converting unlabeled images into visual instruction tuning data. To this end, this paper is the first to explore the potential of empowering MLLM to generate data rather than prompting GPT-4. We introduce Genixer, a holistic data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.

5/21/2024

cs.CV cs.AI

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial visual-textual misalignment problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

6/12/2024

cs.CV

🌀

MAmmoTH2: Scaling Instructions from the Web

Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

5/24/2024

cs.CL