SPAN: Learning Similarity between Scene Graphs and Images with Transformers

2304.00590

Published 5/21/2024 by Yuren Cong, Wentong Liao, Bodo Rosenhahn, Michael Ying Yang

🤿

Abstract

Learning similarity between scene graphs and images aims to estimate a similarity score given a scene graph and an image. There is currently no research dedicated to this task, although it is critical for scene graph generation and downstream applications. Scene graph generation is conventionally evaluated by Recall$@K$ and mean Recall$@K$, which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, such triplet-oriented metrics fail to demonstrate the overall semantic difference between a scene graph and an image and are sensitive to annotation bias and noise. Using generated scene graphs in the downstream applications is therefore limited. To address this issue, for the first time, we propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images. Our novel framework consists of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings. Based on our framework, we propose R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation. We establish new benchmarks on the Visual Genome and Open Images datasets. Extensive experiments are conducted to verify the effectiveness of SPAN, which shows great potential as a scene graph encoder.

Create account to get full access

Overview

This paper proposes a novel framework called SPAN (Scene graPh-imAge coNtrastive learning) to measure the similarity between scene graphs and images.
Existing metrics for evaluating scene graph generation, such as Recall@K, fail to capture the overall semantic difference between a scene graph and an image.
SPAN aligns scene graphs and their corresponding images in a shared latent space using a graph Transformer and an image Transformer.
The paper introduces a new graph serialization technique and a new evaluation metric called R-Precision for image retrieval accuracy.
Experiments on the Visual Genome and Open Images datasets demonstrate the effectiveness of SPAN as a scene graph encoder.

Plain English Explanation

The paper addresses a problem in the field of computer vision, where researchers try to understand the relationships between objects in images. This is known as "scene graph generation," and it's important for various applications, such as image retrieval and zero-shot referring expression comprehension.

The traditional way of evaluating scene graph generation is to look at how many of the predicted relationships (called "triplets") match the ground truth, as labeled by humans. However, this metric doesn't capture the overall similarity between the generated scene graph and the actual image. It's also sensitive to errors and biases in the human annotations.

To address this issue, the researchers propose a new framework called SPAN. SPAN uses machine learning techniques to align the scene graphs and their corresponding images in a shared "latent space." This means that similar scene graphs and images will be represented by similar vectors, or points, in this space.

The key innovations in SPAN are:

A new way of transforming a scene graph into a sequence of tokens, which preserves the structural information
A new evaluation metric called R-Precision, which measures how well the system can retrieve the right image given a scene graph

The researchers show that SPAN outperforms existing approaches on standard benchmarks, demonstrating its potential as a powerful tool for scene graph generation and related applications.

Technical Explanation

The paper proposes a novel framework called SPAN (Scene graPh-imAge coNtrastive learning) to measure the similarity between scene graphs and images. Scene graph generation is typically evaluated using Recall@K and mean Recall@K, which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, these triplet-oriented metrics fail to capture the overall semantic difference between a scene graph and an image, and they are sensitive to annotation bias and noise.

To address this issue, the SPAN framework consists of a graph Transformer and an image Transformer that align scene graphs and their corresponding images in a shared latent space. The paper introduces a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings, preserving the graph structure. This sequence is then fed into the graph Transformer, while the image is processed by the image Transformer.

The authors also propose a new evaluation metric called R-Precision, which measures the image retrieval accuracy based on the scene graph. This metric aims to better reflect the overall semantic similarity between the scene graph and the image.

The effectiveness of SPAN is evaluated on the Visual Genome and Open Images datasets. The experiments show that SPAN outperforms existing approaches, demonstrating its potential as a powerful scene graph encoder.

Critical Analysis

The paper introduces a novel and well-designed framework to address the limitations of existing scene graph generation evaluation metrics. The use of contrastive learning to align scene graphs and images in a shared latent space is a promising approach, and the proposed graph serialization technique is an interesting contribution.

However, the paper does not discuss the computational complexity and training time of the SPAN framework, which could be an important consideration for real-world applications. Additionally, the authors could have explored the robustness of SPAN to different types of scene graph noise or annotation errors.

Furthermore, the paper focuses on the task of measuring similarity between scene graphs and images, but it does not directly address the practical applications of this technology, such as image retrieval or 3D image matching. Exploring these use cases and their potential impact could have strengthened the paper's contribution.

Overall, the SPAN framework represents a significant step forward in scene graph generation evaluation, and the authors' insights and methodological innovations are valuable contributions to the field of computer vision.

Conclusion

This paper proposes a novel framework called SPAN (Scene graPh-imAge coNtrastive learning) to measure the similarity between scene graphs and images. The key innovations include a graph serialization technique and a new evaluation metric called R-Precision, which aim to better capture the overall semantic difference between a scene graph and an image.

The experimental results demonstrate the effectiveness of SPAN as a scene graph encoder, suggesting its potential for improving scene graph generation and enabling new applications such as image retrieval and zero-shot referring expression comprehension. The paper's contributions represent an important step forward in addressing the limitations of existing scene graph evaluation metrics and expanding the capabilities of computer vision systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen

Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.

5/27/2024

cs.CV

👁️

Semantic-embedded Similarity Prototype for Scene Recognition

Chuanxin Song, Hanbo Wu, Xin Ma, Yibin Li

Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting challenge emerges as object information extraction techniques require heavy computational costs, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices in practical deployment. In contrast, this paper proposes a semantic knowledge-based similarity prototype, which can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are used to explore correlations between scene classes, ultimately constructing a similarity prototype. Furthermore, we propose to leverage the similarity prototype to support network training from the perspective of Gradient Label Softening and Batch-level Contrastive Loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments. Code and the statistical similarity prototype will be available at https://github.com/ChuanxinSong/SimilarityPrototype

5/21/2024

cs.CV

Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

Senran Fan, Zhicheng Bao, Chen Dong, Haotai Liang, Xiaodong Xu, Ping Zhang

Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric -- SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems.

6/7/2024

cs.CV cs.AI

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.

4/10/2024

cs.CV