360+x: A Panoptic Multi-modal Scene Understanding Dataset

2404.00989

Published 4/9/2024 by Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, Jianbo Jiao

360+x: A Panoptic Multi-modal Scene Understanding Dataset

Abstract

Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.

Create account to get full access

Overview

Introduces a new large-scale dataset called "360+x" for panoptic multi-modal scene understanding
Provides a comprehensive benchmark for evaluating the performance of computer vision models in understanding 360-degree panoramic scenes
Includes a diverse range of sensors, such as RGB cameras, depth sensors, and inertial measurement units (IMUs), to capture the complexity of real-world environments

Plain English Explanation

The "360+x" dataset is a new resource for training and testing AI models that aim to understand the world around us. It provides a large collection of 360-degree panoramic images and videos, along with additional sensor data like depth information and motion data. This allows AI systems to learn about the full, 360-degree context of a scene, rather than just a limited field of view.

The goal of this dataset is to push the boundaries of what computer vision models can do, by challenging them to make sense of these rich, multi-dimensional scenes. Rather than just identifying objects or recognizing actions, the models need to understand the complete "panoptic" view of the environment, including the relationships between different elements. This could lead to AI systems that can better navigate, interact with, and assist humans in the real world.

Technical Explanation

The "360+x" dataset was created to address the limitations of existing scene understanding datasets, which often focus on narrow fields of view or lack the diversity of sensor modalities needed to truly capture the complexity of real-world environments. By providing 360-degree panoramic images and videos, along with a range of complementary sensor data (such as depth, pose, and interaction information), the dataset enables the development of more robust and comprehensive panoptic segmentation models.

The dataset covers a diverse range of indoor and outdoor scenes, including homes, offices, parks, and city streets. Each scene is annotated with detailed segmentation masks, object bounding boxes, and instance-level information, allowing AI models to learn how to understand the relationships between different elements in the environment.

Critical Analysis

The "360+x" dataset represents a significant advancement in the field of multi-modal scene understanding, addressing many of the limitations of previous datasets. However, some potential concerns and areas for further research include:

Scalability: While the dataset is large, there may be a need for even more diverse and comprehensive data to fully capture the complexity of real-world environments, especially in edge cases or rare scenarios.
Annotation Quality: The accuracy and consistency of the dataset's annotations, particularly for complex and ambiguous scenes, will be crucial for the reliable training and evaluation of AI models.
Real-World Applicability: It remains to be seen how well the models trained on this dataset will generalize to actual deployment scenarios, where the environment, lighting, and sensor capabilities may differ from the dataset's characteristics.

Addressing these concerns and further expanding the dataset's capabilities could lead to even more advanced 360-degree localization and scene understanding solutions, with significant implications for a wide range of applications, from autonomous navigation to immersive virtual reality experiences.

Conclusion

The "360+x" dataset represents a significant step forward in the field of multi-modal scene understanding, providing a comprehensive benchmark for evaluating the performance of AI models in comprehending the full, 360-degree context of real-world environments. By combining diverse sensor data and detailed annotations, the dataset enables the development of more robust and capable computer vision systems, with potential applications in areas like autonomous navigation, augmented reality, and intelligent assistants. As the field continues to evolve, datasets like "360+x" will play a crucial role in driving the progress of AI technology and its ability to understand and interact with the world around us.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!360 in the Wild: Dataset for Depth Prediction and View Synthesis

Kibaek Park, Francois Rameau, Jaesik Park, In So Kweon

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

6/28/2024

cs.CV cs.AI

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, Sai-Kit Yeung

Portable 360$^circ$ cameras are becoming a cheap and efficient tool to establish large visual databases. By capturing omnidirectional views of a scene, these cameras could expedite building environment models that are essential for visual localization. However, such an advantage is often overlooked due to the lack of valuable datasets. This paper introduces a new benchmark dataset, 360Loc, composed of 360$^circ$ images with ground truth poses for visual localization. We present a practical implementation of 360$^circ$ mapping combining 360$^circ$ images with lidar data to generate the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that explores the challenge of cross-device visual positioning, involving 360$^circ$ reference frames, and query frames from pinhole, ultra-wide FoV fisheye, and 360$^circ$ cameras. We propose a virtual camera approach to generate lower-FoV query frames from 360$^circ$ images, which ensures a fair comparison of performance among different query types in visual localization tasks. We also extend this virtual camera approach to feature matching-based and pose regression-based methods to alleviate the performance loss caused by the cross-device domain gap, and evaluate its effectiveness against state-of-the-art baselines. We demonstrate that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures. These results provide new insights into 360-camera mapping and omnidirectional visual localization with cross-device queries.

6/3/2024

cs.CV

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu, Tai Wang, Jingli Lin, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

6/14/2024

cs.CV cs.AI cs.RO

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.

6/6/2024

cs.CV cs.MM