EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

2403.16182

Published 6/6/2024 by Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang and 1 other

cs.CV

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Abstract

Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing on the potential applications in daily assistance and professional support, EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end, we present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment, along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views, thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world. Code and data can be found at: https://github.com/OpenGVLab/EgoExoLearn

Create account to get full access

Overview

The paper introduces a dataset called "EgoExoLearn" for studying procedural activities in the real world from both ego-centric (first-person) and exo-centric (third-person) perspectives.
The dataset captures synchronized videos, sensor data, and annotations for a variety of tasks, allowing researchers to explore how people learn and perform skills from different viewpoints.
The paper outlines the dataset's design, data collection process, and potential applications in areas like activity recognition, skill learning, and human-robot interaction.

Plain English Explanation

The researchers who created the EgoExoLearn dataset wanted to study how people learn and carry out everyday tasks and skills. To do this, they recorded videos and sensor data from two different perspectives - the first-person view of the person performing the task (ego-centric), and the third-person view of an observer watching the person (exo-centric).

By capturing both of these viewpoints simultaneously, the researchers hoped to better understand the differences between how we see and experience our own actions versus how others perceive them. This could provide insights into things like how we learn new skills, how we communicate complex procedures to others, and how robots or intelligent systems might be able to observe and assist humans more effectively.

The EgoExoLearn dataset contains videos, sensor data, and detailed annotations for a wide variety of everyday tasks, like cooking, craft-making, and home maintenance. The researchers designed the data collection process to be as natural and unobtrusive as possible, so the activities captured would be representative of real-world situations.

Technical Explanation

The EgoExoLearn dataset was created to support research into bridging the gap between first-person (ego-centric) and third-person (exo-centric) perspectives on procedural activities. The dataset captures synchronized video, sensor, and annotation data for a range of everyday tasks performed by participants in a real-world setting.

The data collection process involved outfitting participants with a head-mounted camera, body-worn sensors, and a third-person camera to record the activity from multiple angles. The researchers also collected detailed annotations, including step-by-step breakdowns of the procedures, object/tool usage, and other contextual information.

By providing this rich, multi-modal dataset, the researchers hope to enable new advances in areas like activity recognition, skill learning, and human-robot interaction. The dataset could also be used to develop more realistic egocentric video synthesis models and to study the differences between how humans and machines perceive and understand procedural activities.

Critical Analysis

The EgoExoLearn dataset represents an important step forward in bridging the gap between ego-centric and exo-centric perspectives on human activities. By capturing both viewpoints simultaneously, the researchers have created a valuable resource for exploring how people learn, communicate, and perform complex skills.

One potential limitation of the dataset is the relatively small number of participants and tasks included. While the researchers aimed for a diverse set of activities, the dataset may not fully capture the full range of variability in how people approach and execute different procedures. Additionally, the dataset is focused on relatively short-term, discrete tasks, rather than longer-term, continuous activities.

Further research could explore ways to scale up the dataset, both in terms of the number of participants and the diversity of tasks covered. Longitudinal studies tracking skill acquisition and development over time could also provide valuable insights. Additionally, incorporating more advanced sensor modalities, such as eye-tracking or physiological measurements, could yield additional perspectives on the cognitive and experiential aspects of procedural activities.

Overall, the EgoExoLearn dataset represents an important contribution to the field of activity understanding and skill learning. By providing a rich, multimodal dataset that bridges ego-centric and exo-centric perspectives, the researchers have opened up new avenues for exploring how humans and machines can better perceive, learn, and interact with the world around them.

Conclusion

The EgoExoLearn dataset introduces a novel approach to studying procedural activities in the real world by capturing both first-person and third-person perspectives simultaneously. This unique dataset has the potential to drive significant advancements in areas like activity recognition, skill learning, and human-robot interaction, by enabling researchers to better understand the differences between how people experience and communicate complex tasks.

While the dataset has some limitations in terms of scale and scope, it represents an important step forward in bridging the gap between ego-centric and exo-centric views of the world. By providing this valuable resource to the research community, the creators of EgoExoLearn have laid the groundwork for further exploration and innovation in this important field of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel expert commentary done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/

4/30/2024

cs.CV cs.AI

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Yuan-Ming Li, Wei-Jin Huang, An-Lan Wang, Ling-An Zeng, Jing-Ke Meng, Wei-Shi Zheng

We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of what, when, and how well. To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main.

6/14/2024

cs.CV

Retrieval-Augmented Egocentric Video Captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

6/21/2024

cs.CV

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Amir Bar, Arya Bakhtiar, Danny Tran, Antonio Loquercio, Jathushan Rajasegaran, Yann LeCun, Amir Globerson, Trevor Darrell

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.

4/16/2024

cs.RO cs.CV