HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Read original: arXiv:2407.18589 - Published 7/29/2024 by Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, Bo Chen

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Overview

A new hierarchical metric called HICEScore for evaluating image captions
Addresses limitations of existing metrics like BLEU, METEOR, CIDEr, etc.
Captures different aspects of caption quality like faithfulness, relevance, and conciseness

Plain English Explanation

The paper presents a new way to evaluate the quality of image captions, called HICEScore. Existing evaluation metrics like BLEU, METEOR, and CIDEr have limitations, especially for longer and more descriptive captions.

The HICEScore metric takes a hierarchical approach, evaluating captions on different aspects like faithfulness to the image, relevance of the content, and conciseness of the language. This allows it to better capture the nuances of high-quality image descriptions, beyond just matching keywords.

Technical Explanation

The HICEScore metric has three main components:

Faithfulness: Measures how well the caption reflects the visual content of the image, using models like CLIP to assess semantic similarity.
Relevance: Evaluates the relevance and informativeness of the caption's content, considering aspects like named entities, visual attributes, and relationships.
Conciseness: Assesses the conciseness and fluency of the caption's language, penalizing overly verbose or redundant descriptions.

The authors conduct experiments on popular image captioning benchmarks like COCO and Flickr30k, demonstrating that HICEScore correlates better with human judgments of caption quality compared to existing metrics.

Critical Analysis

The HICEScore metric addresses important limitations of existing evaluation approaches, particularly for long and detailed captions. By considering multiple aspects of caption quality, it provides a more nuanced and comprehensive assessment.

However, the authors acknowledge that the metric still has some limitations. For example, it may not fully capture the creative or emotional aspects of captions, and its performance could depend on the specific models and datasets used for the underlying evaluations.

Additionally, the hierarchical nature of the metric adds complexity, and there may be tradeoffs or dependencies between the different components (faithfulness, relevance, conciseness) that are not fully explored.

Conclusion

The HICEScore metric represents an important step forward in image captioning evaluation, providing a more holistic and nuanced assessment of caption quality. As image captioning systems continue to advance, robust and comprehensive evaluation tools like this will be crucial for driving progress and ensuring the development of high-quality, meaningful descriptions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, Bo Chen

Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments.Our code is available at https://github.com/joeyz0z/HICE.

7/29/2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

7/31/2024

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

Uri Berger, Gabriel Stanovsky, Omri Abend, Lea Frermann

The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human judgements. Instead, we propose EnsembEval -- an ensemble of evaluation methods achieving the highest reported correlation with human judgements across 5 image captioning datasets, showing there is a lot of room for improvement by leveraging a diverse set of metrics.

8/12/2024

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Yebin Lee, Imseong Park, Myungjoo Kang

Most existing image captioning evaluation metrics focus on assigning a single numerical score to a caption by comparing it with reference captions. However, these methods do not provide an explanation for the assigned score. Moreover, reference captions are expensive to acquire. In this paper, we propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics. By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions, and provide the explanation for the assigned score. We introduce score smoothing to align as closely as possible with human judgment and to be robust to user-defined grading criteria. FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks and reaches state-of-the-art results on Flickr8k-CF, COMPOSITE, and Pascal-50S within the domain of reference-free evaluation metrics. Our source code and results are publicly available at: https://github.com/Yebin46/FLEUR.

6/11/2024