Benchmarking and Improving Detail Image Caption

2405.19092

Published 6/3/2024 by Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan Guo

Benchmarking and Improving Detail Image Caption

Abstract

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information). CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics. The proposed benchmark and metric provide reliable evaluation for LVLM's detailed image captioning ability. Guided by this evaluation, we further explore to unleash LVLM's detail caption capabilities by synthesizing high-quality data through a five-stage data construction pipeline. Our pipeline only uses a given LVLM itself and other open-source tools, without any human or GPT-4V annotation in the loop. Experiments show that the proposed data construction strategy significantly improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm. All code and dataset will be publicly available at https://github.com/foundation-multimodal-models/CAPTURE.

Create account to get full access

Overview

• This paper introduces a new benchmark for evaluating the quality of detailed image captions, and proposes improvements to existing image captioning models to better capture fine-grained details.

Plain English Explanation

• The paper focuses on the task of image captioning, which is the process of automatically generating textual descriptions of the contents of an image. While most image captioning models today can produce high-level summaries of images, the authors argue that they often miss important fine-grained details.

• To address this, the paper introduces a new benchmark dataset called Visual Fact Checker that contains images with associated detailed captions covering a wide range of visual attributes and relationships. This dataset is designed to evaluate how well captioning models can describe intricate visual details.

• The paper also proposes several architectural and training modifications to existing image captioning models, such as incorporating external visual knowledge and cultural awareness, in order to improve their ability to generate detailed and accurate image descriptions.

Technical Explanation

• The authors first introduce the Visual Fact Checker dataset, which contains over 200,000 images with associated captions that describe fine-grained visual details. This dataset is used to benchmark the performance of different image captioning models.

• To improve the captioning quality, the paper proposes several model enhancements. This includes integrating retrieval-augmented captioning to incorporate external visual knowledge, and cultural-aware pretraining to improve the model's understanding of visual concepts.

• The authors also introduce a novel distillation-based training approach to transfer knowledge from larger, more capable vision-language models to their captioning model, further boosting its performance on detailed image description.

Critical Analysis

• The authors acknowledge that their proposed approaches rely on access to additional external datasets and model components, which may not always be available in practical deployment scenarios.

• While the Visual Fact Checker benchmark represents an important step forward in evaluating detailed image captioning, the authors note that it may still not capture the full complexity and diversity of real-world visual scenes.

• Future work could explore ways to further improve the generalization capabilities of these captioning models, such as by incorporating more robust multimodal reasoning or leveraging cross-modal retrieval techniques.

Conclusion

• This paper presents a new benchmark for detailed image captioning and proposes several model enhancements to improve the ability of captioning systems to generate accurate and comprehensive descriptions of visual content.

• The work highlights the importance of moving beyond high-level, coarse-grained image captioning towards more fine-grained and contextually-aware visual understanding, which has significant implications for applications like visual fact-checking, culturally-aware image analysis, and other domains that require a deeper understanding of visual scenes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, Adriana Romero-Soriano

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

6/18/2024

cs.CV

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.

4/9/2024

cs.CV

The Solution for the CVPR2024 NICE Image Captioning Challenge

Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li

This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach achieves a CIDEr score of 234.11.

4/30/2024

cs.CV

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

5/1/2024

cs.CV