WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

2403.12686

Published 4/8/2024 by Runwei Guan, Liye Jia, Fengyufan Yang, Shanliang Yao, Erick Purwanto, Xiaohui Zhu, Eng Gee Lim, Jeremy Smith, Ka Lok Man, Xuming Hu and 1 other

cs.CV cs.MM cs.RO

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

Abstract

The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.

Create account to get full access

Overview

The paper proposes a system called "WaterVG" for visually grounding waterways in natural language descriptions using a combination of text-guided vision and mmWave radar sensing.
The key innovations include a new waterway dataset, a multimodal fusion architecture, and a radar-augmented visual grounding approach.
The system aims to enable improved navigation and understanding of waterway environments, with potential applications in autonomous vehicles, robotics, and human-computer interaction.

Plain English Explanation

The paper describes a new system called "WaterVG" that helps computers understand and identify waterways (like rivers, lakes, and canals) based on natural language descriptions. This is an important task for applications like self-driving boats or underwater robots that need to navigate waterway environments.

The key idea is to combine two different types of sensors - a camera to see the visual scene, and a radar system to detect objects and structures in the water. By fusing the information from these two modalities, the system can more accurately identify and locate the waterways mentioned in the text descriptions.

The researchers also created a new dataset of waterway images and text descriptions to train and evaluate their system. This dataset is an important contribution, as existing datasets for this task are limited.

Overall, the WaterVG system represents an advancement in the field of "visually grounded language understanding" - the ability for computers to connect natural language to the visual world. This could have valuable applications in areas like autonomous navigation, assistive technology, and human-robot interaction.

Technical Explanation

The paper introduces a new approach for visually grounding natural language descriptions of waterways, called "WaterVG". The key innovations include:

A novel waterway dataset collected and annotated by the authors, containing images of various waterway types (rivers, lakes, canals, etc.) paired with textual descriptions.
A multimodal fusion architecture that combines text-guided vision and mmWave radar sensing to localize and recognize waterways in the visual scene.
A radar-augmented visual grounding approach that leverages the complementary strengths of visual and radar data to improve waterway detection and segmentation.

The authors evaluate their WaterVG system on the new waterway dataset, as well as existing visual grounding benchmarks. The results demonstrate that the multimodal fusion approach outperforms unimodal vision-only baselines, particularly in challenging scenarios with partial occlusions or poor visibility.

Critical Analysis

The paper makes a valuable contribution by addressing the important problem of waterway understanding and navigation using multimodal sensing. The authors' novel dataset and radar-augmented visual grounding approach represent a significant step forward in this area.

However, the paper also acknowledges several limitations and avenues for future work. For example, the dataset is relatively small and may not capture the full diversity of waterway environments. Additionally, the current system relies on pre-trained vision and language models, which could limit its performance on out-of-domain inputs.

Further research could explore ways to expand the dataset, develop more robust and generalizable multimodal fusion techniques, and investigate the system's performance in real-world autonomous navigation scenarios. Incorporating additional sensor modalities, such as sonar or lidar, could also be a promising direction to explore.

Overall, the WaterVG system demonstrates the potential of combining multiple sensory inputs for improved language grounding and understanding of complex visual environments. The work lays a solid foundation for future research in this important and impactful area.

Conclusion

The "WaterVG" system proposed in this paper represents a significant advancement in the field of visually grounded language understanding, with a specific focus on the task of recognizing and localizing waterways in natural language descriptions.

The key innovations include a new waterway dataset, a multimodal fusion architecture that combines text-guided vision and mmWave radar sensing, and a radar-augmented visual grounding approach. The results show that the proposed system outperforms vision-only baselines, particularly in challenging scenarios.

While the paper acknowledges several limitations and areas for future work, the WaterVG system represents an important step towards enabling more robust and accurate navigation and understanding of waterway environments. This could have valuable applications in autonomous vehicles, robotics, and human-computer interaction, ultimately improving our ability to interact with and understand the world around us.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

4/23/2024

cs.CV

🤷

WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmarks for Autonomous Driving on Water Surfaces

Shanliang Yao, Runwei Guan, Zhaodong Wu, Yi Ni, Zile Huang, Ryan Wen Liu, Yong Yue, Weiping Ding, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Jieming Ma, Xiaohui Zhu, Yutao Yue

Autonomous driving on water surfaces plays an essential role in executing hazardous and time-consuming missions, such as maritime surveillance, survivors rescue, environmental monitoring, hydrography mapping and waste cleaning. This work presents WaterScenes, the first multi-task 4D radar-camera fusion dataset for autonomous driving on water surfaces. Equipped with a 4D radar and a monocular camera, our Unmanned Surface Vehicle (USV) proffers all-weather solutions for discerning object-related information, including color, shape, texture, range, velocity, azimuth, and elevation. Focusing on typical static and dynamic objects on water surfaces, we label the camera images and radar point clouds at pixel-level and point-level, respectively. In addition to basic perception tasks, such as object detection, instance segmentation and semantic segmentation, we also provide annotations for free-space segmentation and waterline segmentation. Leveraging the multi-task and multi-modal data, we conduct benchmark experiments on the uni-modality of radar and camera, as well as the fused modalities. Experimental results demonstrate that 4D radar-camera fusion can considerably improve the accuracy and robustness of perception on water surfaces, especially in adverse lighting and weather conditions. WaterScenes dataset is public on https://waterscenes.github.io.

6/18/2024

cs.CV cs.RO

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions

Daizong Liu, Yang Liu, Wencan Huang, Wei Hu

Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing. In this survey, we attempt to provide a comprehensive overview of the T-3DVG progress, including its fundamental elements, recent research advances, and future research directions. To the best of our knowledge, this is the first systematic survey on the T-3DVG task. Specifically, we first provide a general structure of the T-3DVG pipeline with detailed components in a tutorial style, presenting a complete background overview. Then, we summarize the existing T-3DVG approaches into different categories and analyze their strengths and weaknesses. We also present the benchmark datasets and evaluation metrics to assess their performances. Finally, we discuss the potential limitations of existing T-3DVG and share some insights on several promising research directions. The latest papers are continually collected at https://github.com/liudaizong/Awesome-3D-Visual-Grounding.

6/11/2024

cs.CV

🤔

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios. Since users usually prefer to provide intention-based expression for the desired object instead of covering all the details, it is necessary for the agents to interpret the intention-driven instructions. Thus, in this work, we take a step further to the intention-driven visual-language (V-L) understanding. To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions. Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration. Besides, various types of models are set up as the baselines to realize our IVG task. Extensive experiments on our IntentionVG dataset and baselines demonstrate the necessity and efficacy of our method for the V-L field. To foster future research in this direction, our newly built dataset and baselines will be publicly available at https://github.com/Rubics-Xuan/IVG.

5/27/2024

cs.CV