Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy

2405.01337

Published 5/3/2024 by Hoang-Quan Nguyen, Thanh-Dat Truong, Khoa Luu

Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy

Abstract

Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.

Create account to get full access

Overview

This paper proposes a novel method for multi-view action recognition using a directed Gromov-Wasserstein discrepancy.
The method aims to address the challenges of cross-view action recognition by aligning different views of the same action.
The authors demonstrate the effectiveness of their approach on several benchmark datasets, outperforming state-of-the-art methods.

Plain English Explanation

The paper focuses on the problem of recognizing human actions in videos captured from multiple camera angles or "views". This is a challenging task because the same action can look quite different depending on the camera position. The authors' proposed method tries to overcome this challenge by aligning the different views of the same action.

At a high level, the key idea is to use a mathematical technique called Gromov-Wasserstein discrepancy to find a way of comparing the different views that is insensitive to the camera position. This allows the model to learn a more robust representation of the action that works across multiple views. The authors show that this approach leads to improved performance on standard benchmarks for multi-view action recognition compared to other state-of-the-art methods.

The intuition behind using Gromov-Wasserstein discrepancy is that it can find a way to match the underlying structure of the action representations across views, even if the raw visual features look quite different. This helps the model focus on the essential elements of the action that are common across views, rather than getting distracted by view-specific details.

Technical Explanation

The core of the authors' approach is a novel multi-view action recognition framework based on the directed Gromov-Wasserstein (DGW) discrepancy. DGW is used to align the feature representations of the same action observed from different camera views.

Specifically, the model first extracts visual features from the input video frames using a convolutional neural network. These features are then aggregated across the video using a multi-view aggregation network to produce a compact representation of the entire action sequence.

The key innovation is the use of DGW to compare these multi-view action representations. DGW allows the model to find an optimal alignment between the feature spaces of different views, even when they have different statistical properties. This enables the model to learn view-invariant action features that can be used for accurate cross-view recognition.

The authors evaluate their approach on several benchmark datasets for multi-view action recognition, including NTU RGB+D and UESTC. Their experiments demonstrate significant performance improvements over existing state-of-the-art methods for this task.

Critical Analysis

The authors provide a thorough experimental evaluation of their method, including comparisons to several strong baselines. The results are compelling and suggest that the DGW-based approach can indeed learn more effective cross-view representations for action recognition.

However, the paper does not delve deeply into the limitations or potential failure cases of the proposed technique. For example, it's unclear how the method would perform in scenarios with highly diverse camera angles or significant occlusions. Additionally, the computational complexity of the DGW alignment process is not discussed, which could be an important practical consideration.

Furthermore, the paper does not explore potential extensions or generalizations of the DGW framework. It would be interesting to see if the same principles could be applied to other multi-view learning tasks, such as object recognition or human pose estimation.

Overall, this is a strong technical contribution that advances the state of the art in multi-view action recognition. But there is still room for further research to better understand the strengths, weaknesses, and broader applicability of the DGW-based approach.

Conclusion

This paper presents a novel method for multi-view action recognition that uses a directed Gromov-Wasserstein discrepancy to align feature representations across different camera views. By learning view-invariant action features, the model can achieve improved performance on cross-view recognition tasks compared to existing state-of-the-art approaches.

The key insight is that Gromov-Wasserstein discrepancy provides a principled way to compare the underlying structure of action representations, rather than just their surface-level visual similarities. This allows the model to focus on the essential elements of the action that are common across views, rather than being distracted by view-specific details.

The authors demonstrate the effectiveness of their method on several benchmark datasets, setting new state-of-the-art results. While the paper does not extensively explore the limitations of the approach, it represents an important step forward in multi-view action recognition and could inspire further research into cross-view learning techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤷

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Ming Xu, Stephen Gould

We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.

4/9/2024

cs.CV cs.LG eess.IV

👁️

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

Thanh-Dat Truong, Khoa Luu

Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the selfish view. First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.

5/16/2024

cs.CV

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection

Matthew Korban, Peter Youngs, Scott T. Acton

This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.

5/15/2024

cs.CV

🤿

A Survey on Backbones for Deep Video Action Recognition

Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

5/10/2024

cs.CV cs.AI