Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

2404.05578

Published 4/9/2024 by Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

Abstract

For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.

Create account to get full access

Overview

Introduces a new approach called "Social-MAE" for learning multi-person motion representations in an unsupervised manner
Leverages a masked autoencoder architecture to capture the social interactions and dynamics between people
Aims to improve the quality of motion synthesis and transfer for various applications like robotics and animation

Plain English Explanation

The paper presents a new technique called "Social-MAE" (Social Masked Autoencoder) for learning how people move and interact with each other in an unsupervised way. The key idea is to take motion data of multiple people moving together, mask out parts of the motion data, and then train an autoencoder model to try to predict the missing parts.

This encourages the model to learn meaningful representations of how people's movements are related and influenced by the movements of others around them. The authors show that this approach leads to better performance on tasks like motion synthesis and transfer compared to models trained without this social awareness.

The motivation is to develop AI systems that can better understand and generate natural human-like motion, which could be useful for applications like robotics, animation, and interactions between humans and AI agents.

Technical Explanation

The core of the Social-MAE approach is a masked autoencoder architecture that takes in multi-person motion data as input. The model randomly masks out portions of the input motion sequences, and then tries to predict the missing parts based on the surrounding context.

This encourages the autoencoder to learn a rich latent representation that captures the social dynamics and interdependencies between the people in the scene. The authors experiment with different masking strategies and show that a "social-aware" masking scheme, which accounts for the relative spatial positions of the people, leads to better performance.

The trained Social-MAE model can then be used as a feature extractor, with the learned latent representations serving as a robust and informative encoding of the multi-person motion. The authors demonstrate the effectiveness of this approach on a variety of downstream tasks, including motion synthesis, transfer, and human-robot interaction.

Critical Analysis

The Social-MAE paper makes a compelling case for the importance of modeling social interactions when learning representations of human motion. The authors provide thorough experiments and ablation studies to validate their approach, and the results show clear improvements over prior methods.

However, one potential limitation is that the paper focuses primarily on short-term, relatively simple motion sequences. It remains to be seen how well the Social-MAE approach would scale to more complex, long-term motion patterns involving larger groups of people.

Additionally, the paper does not explore the interpretability or explainability of the learned representations. It would be interesting to understand how the model is capturing the social dynamics and what specific relational cues it is learning to identify.

Despite these minor caveats, the Social-MAE paper represents a significant contribution to the field of human motion modeling and has the potential to enable more natural and socially aware AI systems across a variety of applications.

Conclusion

The Social-MAE paper introduces a novel approach for learning multi-person motion representations in an unsupervised manner. By leveraging a masked autoencoder architecture to capture the social dynamics between people, the model is able to learn more robust and informative encodings of human motion.

The authors demonstrate the effectiveness of this approach on several downstream tasks, showcasing its potential to enhance the realism and social awareness of AI systems in areas like robotics, animation, and human-machine interaction. While the paper has some limitations, it represents an important step forward in the field of human motion modeling and highlights the importance of considering social context when learning representations of human behavior.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏷️

Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders

Simon Dahan, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Emma C. Robinson

The development of robust and generalisable models for encoding the spatio-temporal dynamics of human brain activity is crucial for advancing neuroscientific discoveries. However, significant individual variation in the organisation of the human cerebral cortex makes it difficult to identify population-level trends in these signals. Recently, Surface Vision Transformers (SiTs) have emerged as a promising approach for modelling cortical signals, yet they face some limitations in low-data scenarios due to the lack of inductive biases in their architecture. To address these challenges, this paper proposes the surface Masked AutoEncoder (sMAE) and video surface Masked AutoEncoder (vsMAE) - for multivariate and spatio-temporal pre-training of cortical signals over regular icosahedral grids. These models are trained to reconstruct cortical feature maps from masked versions of the input by learning strong latent representations of cortical structure and function. Such representations translate into better modelling of individual phenotypes and enhanced performance in downstream tasks. The proposed approach was evaluated on cortical phenotype regression using data from the young adult Human Connectome Project (HCP) and developing HCP (dHCP). Results show that (v)sMAE pre-trained models improve phenotyping prediction performance on multiple tasks by $ge 26%$, and offer faster convergence relative to models trained from scratch. Finally, we show that pre-training vision transformers on large datasets, such as the UK Biobank (UKB), supports transfer learning to low-data regimes. Our code and pre-trained models are publicly available at https://github.com/metrics-lab/surface-masked-autoencoders .

6/12/2024

eess.IV cs.CV

👁️

MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

5/17/2024

cs.CV

MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Weichao Zhao, Hezhen Hu, Wengang Zhou, Yunyao Mao, Min Wang, Houqiang Li

Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: 1) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. 2) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks.

6/3/2024

cs.CV

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

Kejia Yin, Varshanth R. Rao, Ruowei Jiang, Xudong Liu, Parham Aarabi, David B. Lindell

Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task, existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms, which neglect the dense prediction nature of the task, (2) aggregate them into memory-intensive hypercolumn formations, and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper, we introduce SCE-MAE, a framework that (1) leverages the MAE, a region-level SSL method that naturally better suits the landmark prediction task, (2) operates on the vanilla feature map instead of on expensive hypercolumns, and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust, outperforming existing SOTA methods by large margins of approximately 20%-44% on the landmark matching and approximately 9%-15% on the landmark detection tasks.

5/29/2024

cs.CV cs.AI