SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos






Published 4/9/2024 by Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos


We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.

Create account to get full access


If you already have an account, we'll log you in


  • This research paper explores a novel approach to learning the sounds associated with different actions by analyzing narrated egocentric (first-person) videos.
  • The proposed method, called "SoundingActions," aims to build a comprehensive database of action-sound associations that can be used to enhance various AI applications, such as audio-visual conversational agents, sound-based vision-language models, and text-guided sound source localization.

Plain English Explanation

The researchers wanted to understand the typical sounds associated with different human actions, such as the sound of pouring water or the sound of opening a door. To do this, they analyzed a large collection of first-person video recordings of people performing various everyday tasks, where the person narrating the video described what they were doing. By aligning the narration with the video, the researchers were able to learn the characteristic sounds that accompany different actions. This information could be used to enhance AI models that can learn to understand the world through both vision and sound, or to help robots better understand the world around them.

Technical Explanation

The researchers developed the SoundingActions framework, which consists of three main components:

  1. Video-Audio Alignment: The method aligns the narration in the egocentric videos with the corresponding audio signals, allowing the researchers to associate specific sounds with the described actions.

  2. Action-Sound Prediction: Using the aligned video-audio data, the researchers trained deep learning models to predict the characteristic sounds of different actions, based on visual cues.

  3. Action-Sound Database: The researchers compiled a comprehensive database of action-sound associations, which can be used to enhance various AI applications that involve understanding the auditory aspects of the physical world.

The experiments demonstrated that the SoundingActions framework can effectively learn the sounds of a wide range of actions, and that the resulting models can generalize to predict the sounds of new, unseen actions with high accuracy.

Critical Analysis

The paper presents a compelling approach to learning the auditory aspects of human actions, which could have significant implications for various AI applications. However, the research is limited to a specific set of everyday tasks and actions captured in the egocentric videos. Further work may be needed to expand the scope of the action-sound database and to address potential biases or limitations in the underlying data.

Additionally, the paper does not delve into the potential ethical considerations or privacy concerns that may arise from using first-person video recordings to build such a comprehensive database of human actions and their associated sounds. These are important factors to consider as the technology continues to develop.


The SoundingActions framework represents a novel and promising approach to understanding the auditory aspects of human actions. By leveraging narrated egocentric videos, the researchers have been able to build a comprehensive database of action-sound associations that can enhance various AI applications, from audio-visual conversational agents to sound-based vision-language models. As the field of AI continues to advance, technologies like SoundingActions may play an increasingly important role in helping machines better understand and interact with the physical world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman





Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

Read more


Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Sagnik Majumder, Ziad Al-Halah, Kristen Grauman





We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom. Project:

Read more



Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Nikhil Singh, Chih-Wei Wu, Iroro Orife, Mahdi Kalayeh





Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.

Read more


Retrieval-Augmented Egocentric Video Captioning

Retrieval-Augmented Egocentric Video Captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie





Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at:

Read more
