DOCCI: Descriptions of Connected and Contrasting Images

Read original: arXiv:2404.19753 - Published 5/1/2024 by Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer and 2 others

👁️

Overview

This paper introduces a new dataset called Descriptions of Connected and Contrasting Images (DOCCI), which contains long, detailed descriptions for 15,000 images.
The authors argue that current vision-language datasets lack the level of detail needed to train models for more advanced tasks like understanding spatial relationships, counting objects, and applying world knowledge.
DOCCI was created by a single researcher to capture these challenging aspects, with human annotators providing comprehensive descriptions averaging 136 words per image.
The authors demonstrate that DOCCI can be an effective training resource for image-to-text generation, with a PaLI 5B model performing competitively against larger models.
They also show that DOCCI can be used as a testbed to expose the limitations of current text-to-image generation models in capturing long, detailed descriptions.

Plain English Explanation

Researchers use datasets of images and their textual descriptions to train artificial intelligence (AI) models that can generate captions or understand images. However, the authors argue that existing datasets lack the level of detail needed to create truly capable models.

To address this, the researchers created a new dataset called DOCCI, which contains 15,000 images with very detailed, human-written descriptions. These descriptions are longer and more comprehensive than typical captions, covering a wide range of challenges like spatial relationships, counting, and applying real-world knowledge.

The researchers show that a relatively small AI model can be trained on DOCCI to perform well on image-to-text tasks, matching or even outperforming much larger models. This suggests that the detailed descriptions in DOCCI provide a valuable training resource.

Additionally, the authors use DOCCI to test the capabilities of text-to-image generation models, finding that current systems struggle to capture the full depth and nuance of the long, complex descriptions. This highlights an important limitation in the state of the art for text-to-image AI.

Overall, the DOCCI dataset represents an important contribution to the field of vision-language AI, offering a new benchmark to push the boundaries of what these models can accomplish.

Technical Explanation

The paper introduces the Descriptions of Connected and Contrasting Images (DOCCI) dataset, which contains 15,000 images with human-written descriptions that are significantly longer and more detailed than typical image captions.

The authors argue that existing vision-language datasets, such as COCO and Flickr30k, lack the level of detail needed to train models for advanced tasks that require understanding spatial relationships, counting, text rendering, and the application of world knowledge.

To address this gap, the researchers instructed human annotators to create comprehensive descriptions for each image in DOCCI, with the goal of clearly distinguishing each image from related or similar ones. The resulting descriptions average 136 words in length and are highly compositional, typically encompassing multiple challenges.

The authors demonstrate the value of DOCCI through both quantitative and qualitative analyses. They show that a PaLI 5B model finetuned on DOCCI performs equally or better than larger models like LLaVA-1.5 7B and InstructBLIP 7B on image-to-text generation tasks.

Furthermore, the researchers use DOCCI as a testbed to highlight the limitations of current text-to-image generation models. They find that these models struggle to capture the full depth and nuance of the long, detailed descriptions in the dataset, suggesting an important area for future advancement in the field of vision-language AI.

Critical Analysis

The DOCCI dataset represents a valuable contribution to the field of vision-language AI, as the authors have identified an important gap in existing datasets and sought to address it.

One potential limitation of the dataset is that it was curated and annotated by a single researcher, which could introduce biases or inconsistencies. It would be interesting to see if the dataset could be expanded or replicated by other researchers to validate the findings.

Additionally, while the authors demonstrate the usefulness of DOCCI for image-to-text generation, they do not delve deeply into its potential applications for other vision-language tasks, such as visual question answering or multimodal reasoning. Further research in these areas could help to fully realize the dataset's potential.

The authors also acknowledge that the current text-to-image models struggle to capture the level of detail and nuance in the DOCCI descriptions, which highlights an important limitation in the state of the art. Addressing this challenge could be a fruitful area for future research, potentially leading to more advanced and capable text-to-image generation systems.

Overall, the DOCCI dataset represents an exciting development in the field of vision-language AI, offering a new benchmark to push the boundaries of what these models can accomplish.

Conclusion

The Descriptions of Connected and Contrasting Images (DOCCI) dataset introduced in this paper fills an important gap in existing vision-language datasets by providing long, detailed descriptions that capture a wide range of visual challenges.

The authors demonstrate that DOCCI can be an effective training resource for image-to-text generation, with a relatively small model performing competitively against larger, more powerful systems. Additionally, they use DOCCI to expose the limitations of current text-to-image generation models, highlighting an important area for future research and development.

Overall, the DOCCI dataset represents a valuable contribution to the field of vision-language AI, offering a new benchmark to advance the state of the art and push the boundaries of what these models can achieve.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.

5/1/2024

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, Adriana Romero-Soriano

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.

6/18/2024

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong Zhang

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

6/12/2024

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut

Despite the longstanding adage an image is worth a thousand words, creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.

5/7/2024