CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

2402.04236

Published 5/24/2024 by Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong and 1 other

cs.CV cs.CL

🔎

Abstract

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, textbf{CogCoM}, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available at https://github.com/THUDM/CogCoM.

Create account to get full access

Overview

Vision-Language Models (VLMs) have shown impressive performance by aligning visual instructions to responses through extensive training.
However, this training approach can lead VLMs to ignore essential visual reasoning, resulting in failures on complex visual problems and unreliable responses.
This paper introduces a new mechanism called Chain of Manipulations, inspired by how humans solve visual problems step-by-step.
The mechanism allows VLMs to solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zooming in) and providing results actively, while also enabling users to trace the causes of errors.

Plain English Explanation

Vision-Language Models (VLMs) are AI systems that can understand and generate language while also processing visual information. These models have become quite effective at completing tasks that involve both text and images, thanks to their extensive training on aligning visual instructions to the appropriate responses.

However, this training approach can lead VLMs to overlook the essential reasoning required for complex visual problems. As a result, these models may fail at meticulous visual tasks or provide responses that are not fully faithful to the original visual information.

Inspired by how humans solve visual problems, the researchers in this paper introduce a new mechanism called Chain of Manipulations. This mechanism enables VLMs to tackle various visual problems by breaking them down into a series of intrinsic manipulations, such as grounding visual elements or zooming in on specific areas. The model can then provide step-by-step results, while also allowing users to trace the causes of any errors.

The key idea is to give VLMs the ability to actively manipulate and reason about visual information, rather than relying solely on aligning visual instructions to responses. This could lead to more robust and interpretable performance on challenging visual tasks.

Technical Explanation

The paper outlines a comprehensive approach to implementing the Chain of Manipulations mechanism for Vision-Language Models (VLMs):

Flexible Design of Manipulations: The researchers conducted extensive analysis to define a flexible set of intrinsic manipulations (e.g., grounding, zooming in) that VLMs can perform to solve visual problems step-by-step.
Automated Data Generation Pipeline: The team developed an efficient automated pipeline to generate high-quality training data for the Chain of Manipulations approach, including 6,000 manually annotated samples for challenging graphical mathematical problems.
Compatible VLM Architecture: The paper introduces a VLM architecture capable of handling multi-turn, multi-image interactions, which is essential for the Chain of Manipulations mechanism.
Versatile Capability Training: The researchers describe a model training process that enables the VLM to develop a range of versatile capabilities for solving various visual problems using the Chain of Manipulations.

The resulting model, called [object Object], is a 17-billion-parameter VLM equipped with the Chain of Manipulations mechanism. The paper demonstrates that CogCoM achieves state-of-the-art performance across 9 benchmarks spanning 4 different categories, while also preserving interpretability through the step-by-step manipulation process.

Critical Analysis

The paper presents a promising approach to improving the visual reasoning capabilities of Vision-Language Models, but it also acknowledges several caveats and potential areas for further research:

The researchers note that the Chain of Manipulations mechanism may not be suitable for all types of visual problems, and there may be trade-offs between the interpretability gained and the overall task performance.
The paper does not provide a detailed analysis of the computational and memory requirements of the Chain of Manipulations approach, which could be an important consideration for real-world deployment.
The authors suggest that further research is needed to better understand the generalization capabilities of the Chain of Manipulations mechanism and its ability to transfer to new visual domains or tasks.

Additionally, while the paper demonstrates impressive results on the chosen benchmarks, it would be valuable to see more extensive evaluation on a broader range of visual reasoning tasks, including those that may require more nuanced or contextual understanding.

Conclusion

The introduction of the Chain of Manipulations mechanism represents a significant step forward in enhancing the visual reasoning capabilities of Vision-Language Models. By enabling these models to actively manipulate and reason about visual information in a step-by-step manner, the researchers have found a way to improve their performance on complex visual problems while also preserving interpretability.

The practical implementation details and the state-of-the-art results presented in this paper suggest that the Chain of Manipulations approach could have important implications for the development of more capable and trustworthy AI systems that can effectively assist humans in a wide range of visual tasks, from scientific analysis to creative problem-solving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li

Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.

4/30/2024

cs.CV cs.CL

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via $href{https://yxxxb.github.io/VoCo-LLaMA-page/}{text{this https URL}}$.

6/19/2024

cs.CV

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in bag-of-words representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

4/26/2024

cs.CV

🛠️

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, Yang Gao

Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing works often provide reward guidance that is too coarse, leading to inefficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4 times$ higher average success rate compared to the best baseline, RoboCLIP, across a series of manipulation tasks.

6/4/2024

cs.RO