ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Read original: arXiv:2403.01306 - Published 6/12/2024 by Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes

ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Overview

This paper provides instructions for authors submitting papers to the ACL 2023 proceedings.
It covers key details like formatting, submission guidelines, and review process.
The goal is to ensure a consistent, high-quality publication for the conference.

Plain English Explanation

This paper outlines the instructions and requirements for researchers who want to submit a paper to the ACL 2023 conference proceedings. The ACL (Association for Computational Linguistics) is a major academic conference in the field of natural language processing and computational linguistics.

The instructions cover important details like the formatting of the paper, including font sizes, margin widths, and layout. There are also guidelines for the overall structure of the paper, such as how to organize sections and include references.

Importantly, the instructions explain the submission process, including deadlines and file formats. This helps ensure all papers are submitted correctly and can be properly reviewed by the conference committee.

The goal of these detailed instructions is to create a consistent, high-quality publication for the ACL 2023 proceedings. By having uniform formatting and structure, it makes it easier for readers to navigate and understand the research presented at the conference.

Technical Explanation

The paper begins with an Introduction that outlines the purpose of the instructions - to provide a template for authors submitting papers to the ACL 2023 conference proceedings.

The Method section covers the key formatting requirements, including:

Page limits and layout (single-column, double-spaced)
Font sizes and styles
Margin widths
Citation and reference formatting

The Results section explains the submission process in detail, such as:

Deadlines for paper and camera-ready submissions
File formats and templates to use
Steps for the review and revision process

Overall, the instructions aim to standardize the appearance and organization of papers to facilitate the review and publication process for the conference. This ensures a consistent, high-quality final proceedings.

Critical Analysis

The instructions provided are comprehensive and well-structured, covering all the key details authors need to know. The clear guidelines and deadlines should help streamline the submission and review process for ACL 2023.

One potential limitation is the strict formatting requirements, which could be burdensome for some authors. However, the standardization is likely necessary for efficiently reviewing a large volume of submissions.

The instructions do not go into detail on the peer review process itself. It would be useful to have more information on the criteria reviewers will use to evaluate papers, as well as any policies around conflicts of interest or reviewer anonymity.

Additionally, the instructions could be improved by providing more guidance on ethical considerations, such as responsible use of data and models. This is an increasingly important issue in NLP research that should be addressed.

Overall, these instructions serve an important function in ensuring a high-quality publication for the ACL 2023 proceedings. With some minor additions, they could be further strengthened to support rigorous, responsible research in the field.

Conclusion

The instructions outlined in this paper provide a clear and comprehensive template for authors submitting papers to the ACL 2023 conference proceedings. The detailed formatting requirements and submission guidelines help ensure a consistent, high-quality publication that facilitates the review and publication process.

While the strict formatting rules may be burdensome for some authors, the standardization is likely necessary given the large volume of submissions. Additional guidance on the review criteria and ethical considerations could further improve the instructions.

Overall, these instructions play a crucial role in supporting rigorous, responsible research in natural language processing and computational linguistics. By setting clear expectations and streamlining the submission process, the ACL can continue to publish impactful work that advances the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes

Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.

6/12/2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

7/31/2024

🖼️

Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations

Xu Zhang, Zhedong Zheng, Linchao Zhu, Yi Yang

Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.

9/4/2024

🖼️

The Solution for the CVPR2023 NICE Image Captioning Challenge

Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu

In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge. Different from the traditional image captioning datasets, this challenge includes a larger new variety of visual concepts from many domains (such as COVID-19) as well as various image types (photographs, illustrations, graphics). For the data level, we collect external training data from Laion-5B, a large-scale CLIP-filtered image-text dataset. For the model level, we use OFA, a large-scale visual-language pre-training model based on handcrafted templates, to perform the image captioning task. In addition, we introduce contrastive learning to align image-text pairs to learn new visual concepts in the pre-training stage. Then, we propose a similarity-bucket strategy and incorporate this strategy into the template to force the model to generate higher quality and more matching captions. Finally, by retrieval-augmented strategy, we construct a content-rich template, containing the most relevant top-k captions from other image-text pairs, to guide the model in generating semantic-rich captions. Our method ranks first on the leaderboard, achieving 105.17 and 325.72 Cider-Score in the validation and test phase, respectively.

7/8/2024