AdaFPP: Adapt-Focused Bi-Propagating Prototype Learning for Panoramic Activity Recognition

Read original: arXiv:2405.02538 - Published 5/7/2024 by Meiqi Cao, Rui Yan, Xiangbo Shu, Guangzhao Dai, Yazhou Yao, Guo-Sen Xie

👁️

Overview

This paper introduces a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework for panoramic activity recognition.
The goal is to identify individual, group, and global activities performed by multiple people in panoramic scenes.
The key innovations are a panoramic adapt-focuser for size-adapting detection of occluded individuals, and a bi-propagation prototyper for closed-loop interaction across activity granularities.

Plain English Explanation

The paper tackles the challenge of panoramic activity recognition. This means automatically identifying the actions and behaviors of multiple people within a wide, 360-degree camera view. This is important for applications like smart surveillance, robot navigation, and activity analysis.

Previous approaches have struggled with two key issues. First, they rely heavily on manually annotated bounding boxes to locate the people, which is time-consuming and limits real-world deployment. Second, using standard object detectors fails to handle the varying sizes and occlusions of people in a panoramic scene.

To address these problems, the researchers propose the AdaFPP framework. The core ideas are:

Panoramic Adapt-Focuser: This module can automatically detect people of different sizes, even when they are partially obscured by others in the crowded panoramic view. It does this by selectively focusing on and refining the detection of people in the most dense sub-regions of the scene.
Bi-Propagation Prototyper: This component promotes information sharing across the recognition of individual, group, and overall activities. It allows the model to learn comprehensive activity representations, even when individual people are not perfectly localized.

By combining these innovations, the AdaFPP framework can accurately recognize a diverse range of activities at multiple levels of granularity, without relying on manual annotations. This makes it a powerful tool for practical panoramic activity analysis applications.

Technical Explanation

The core technical contribution of this work is the AdaFPP framework, which jointly learns:

Panoramic Adapt-Focuser: This is a specialized object detector that can handle the varying sizes and occlusions of people in panoramic scenes. It first generates initial detections using a standard detector, then selectively focuses on and refines the detections in the most crowded sub-regions of the image.
Bi-Propagation Prototyper: This module facilitates bidirectional information propagation between the recognition of individual, group, and global activities. It learns prototypical representations that capture the essential features of each activity type at different granularities, and propagates these across the levels to improve overall performance.

The experiments demonstrate that AdaFPP significantly outperforms prior art on panoramic activity recognition datasets. Importantly, it achieves this without relying on manual bounding box annotations, making it much more scalable and practical for real-world deployment.

Critical Analysis

A key strength of this work is its ability to handle partial occlusions of people in panoramic scenes. The adapt-focuser module effectively deals with the challenge of varying person sizes and locations, which is crucial for reliable activity recognition in complex, crowded environments.

However, the paper does not extensively evaluate the model's robustness to other common challenges in activity recognition, such as viewpoint changes, lighting variations, or unusual activities. Further experiments in these areas would help demonstrate the broader applicability of the AdaFPP framework.

Additionally, the paper does not provide much insight into the computational efficiency or real-time performance of the approach. These factors are crucial for many practical applications of panoramic activity recognition, such as surveillance or robotics, and should be considered in future work.

Conclusion

This paper presents a novel AdaFPP framework that advances the state-of-the-art in panoramic activity recognition. By jointly learning an adaptive person detector and multi-granularity activity prototypes, the framework can accurately recognize individual, group, and global activities in complex, crowded panoramic scenes.

The key innovations - the panoramic adapt-focuser and bi-propagation prototyper - demonstrate the potential for robust and scalable panoramic activity analysis. This work paves the way for more practical applications of this technology in domains like smart cities, autonomous systems, and human behavior understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

AdaFPP: Adapt-Focused Bi-Propagating Prototype Learning for Panoramic Activity Recognition

Meiqi Cao, Rui Yan, Xiangbo Shu, Guangzhao Dai, Yazhou Yao, Guo-Sen Xie

Panoramic Activity Recognition (PAR) aims to identify multi-granularity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR.

5/7/2024

MPT-PAR:Mix-Parameters Transformer for Panoramic Activity Recognition

Wenqing Gan, Yan Sun, Feiran Liu, Xiangfeng Luo

The objective of the panoramic activity recognition task is to identify behaviors at various granularities within crowded and complex environments, encompassing individual actions, social group activities, and global activities. Existing methods generally use either parameter-independent modules to capture task-specific features or parameter-sharing modules to obtain common features across all tasks. However, there is often a strong interrelatedness and complementary effect between tasks of different granularities that previous methods have yet to notice. In this paper, we propose a model called MPT-PAR that considers both the unique characteristics of each task and the synergies between different tasks simultaneously, thereby maximizing the utilization of features across multi-granularity activity recognition. Furthermore, we emphasize the significance of temporal and spatial information by introducing a spatio-temporal relation-enhanced module and a scene representation learning module, which integrate the the spatio-temporal context of action and global scene into the feature map of each granularity. Our method achieved an overall F1 score of 47.5% on the JRDB-PAR dataset, significantly outperforming all the state-of-the-art methods.

8/2/2024

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Zhengcen Li, Xinle Chang, Yueran Li, Jingyong Su

Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms RGB-based approaches by using only estimated 2D keypoints as input. Code is available at https://github.com/mgiant/MP-GCN

7/30/2024

🔗

360SFUDA++: Towards Source-free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes

Xu Zheng, Pengyuan Zhou, Athanasios V. Vasilakos, Lin Wang

In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.

4/26/2024