H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

2403.20213

Published 4/1/2024 by Chao Pang, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Xingxing Weng, Shuai Wang, Litong Feng, Gui-Song Xia and 1 other

cs.CV

H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

Abstract

The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable hallucination problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .

Get summaries of the top AI research delivered straight to your inbox:

Introduction

Figure 1: Examples of large-scale RS vision-language datasets. RS5M [70] consists of two subsets: RS5M-RS3 and RS5M-PUB11. (a) HqDC-1.4M (ours) captions provide detailed descriptions, including moving objects, rapidly changing information (i.e. the seasonal feature in vegetation) and rich attributes of objects. (b) RS5M-RS3’s captions are short and lack detail. (c) RS5M-PUB11’s images are not typical RS images. The captions are short, repetitive texts, often containing redundance (such as “two houses” in the figure). (d) SkyScript [52] captions are from OpenStreetMap, therefore lacking moving objects and seasonal sensitive information; the descriptions are relatively short. (a) and (b) share the same image. We selected images of similar scene for (c) and (d).

Figure 2: When encountering unanswerable questions, the prior VLMs are probably to produce erroneous outputs, known as “hallucination”. Our H2RSVLM possesses strong honesty and can refuse to answer unanswerable questions.

Related Work

Helpful and Honest Datasets for RSVLMs

H2RSVLM

Figure 5: Training stages of H2RSVLM.

Experiment

Figure 6: Conversations between user and H2RSVLM.

Conclusion

Appendix 0.A HqDC-1.4M Dataset

Figure 7: Examples of HqDC-1.4M. Black sentences represent descriptions that are entirely accurate, green sentences represent descriptions that are partially accurate, and red sentences represent descriptions that are entirely incorrect.

Appendix 0.B HqDC-Instruct Dataset

Figure 8: Examples of HqDC-Instruct Dataset.

Appendix 0.C RSSA Dataset

Appendix 0.D RS-Specialized-Instruct Dataset

Appendix 0.E RS-ClsQaGrd-Instruct Dataset

Appendix 0.F Evaluated Test Datasets

Appendix 0.G More Results of H2RSVLM

Figure 9: Examples of the results from multi-label landcover classification task.

Figure 10: Examples of the results from image ground sampling distance estimation task.

Figure 11: Examples of the results from counting task.

Figure 12: Examples of the results from image type identification task.

Figure 13: Examples of the results from image type identification task.

Figure 14: Examples of the results from visual grounding task.The red box represents the predicted result, while the blue box represents the ground truth.

Figure 15: Examples of the results from building footprint vectorization task. The red polygon represents the predicted result, while the blue polygon represents the ground truth.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⛏️

Vision-Language Models in Remote Sensing: Current Progress and Future Trends

Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, Xiao Xiang Zhu

The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide intelligent solutions close to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in remote sensing primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This makes them better suited for tasks requiring visual and textual understanding, such as image captioning, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting challenges, and identifying potential research opportunities.

4/3/2024

cs.CV cs.AI

🖼️

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

4/30/2024

cs.MM cs.CV

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

4/10/2024

cs.CV

Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

Mamadou Keita, Wassim Hamidouche, Hassen Bougueffa, Abdenour Hadid, Abdelmalik Taleb-Ahmed

In recent years, the emergence of models capable of generating images from text has attracted considerable interest, offering the possibility of creating realistic images from text descriptions. Yet these advances have also raised concerns about the potential misuse of these images, including the creation of misleading content such as fake news and propaganda. This study investigates the effectiveness of using advanced vision-language models (VLMs) for synthetic image identification. Specifically, the focus is on tuning state-of-the-art image captioning models for synthetic image detection. By harnessing the robust understanding capabilities of large VLMs, the aim is to distinguish authentic images from synthetic images produced by diffusion-based models. This study contributes to the advancement of synthetic image detection by exploiting the capabilities of visual language models such as BLIP-2 and ViTGPT2. By tailoring image captioning models, we address the challenges associated with the potential misuse of synthetic images in real-world applications. Results described in this paper highlight the promising role of VLMs in the field of synthetic image detection, outperforming conventional image-based detection techniques. Code and models can be found at https://github.com/Mamadou-Keita/VLM-DETECT.

4/4/2024

cs.CV cs.CR cs.LG