Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

Read original: arXiv:2310.10352 - Published 4/23/2024 by Yifei Qian, Xiaopeng Hong, Zhongliang Guo, Ognjen Arandjelovi'c, Carl R. Donovan

🤔

Overview

Presents a semi-supervised method to train a crowd counting model with fewer labeled data
Proposes a "subitizing" approach to leverage the model's understanding of crowd scenes
Incorporates a fine-grained density classification task to aid feature learning
Achieves state-of-the-art performance on challenging benchmarks

Plain English Explanation

Training a reliable crowd counting model typically requires a large amount of annotated data, which can be time-consuming and costly to obtain. To address this challenge, the paper introduces a new semi-supervised approach based on the mean teacher framework.

When there is a limited amount of labeled data available, the model may overfit to local patches and struggle to generalize. The conventional approach of solely improving the accuracy of local patch predictions using unlabeled data proves inadequate in such cases.

Instead, the paper proposes a more nuanced approach: fostering the model's intrinsic "subitizing" capability. Subitizing refers to the human ability to quickly and accurately estimate the number of objects in a scene, without relying on laborious counting. The researchers apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on holistic cues, similar to how humans use their understanding of the overall scene to estimate crowd numbers.

Furthermore, the paper incorporates a fine-grained density classification task to help with feature learning. This general approach can be applied to most existing crowd counting methods without strict structural or loss constraints.

The researchers observe that the model trained with their framework exhibits a "subitizing-like" behavior. It can accurately predict low-density regions with just a "glance," while incorporating local details to predict high-density regions. This results in state-of-the-art performance on challenging benchmarks, surpassing previous approaches by a large margin.

Technical Explanation

The paper presents a semi-supervised method based on the mean teacher framework to train a reliable crowd counting model while alleviating the heavy annotation burden. When there is a scarcity of labeled data, the model is prone to overfit local patches. In such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate.

To address this, the researchers propose a more nuanced approach: fostering the model's intrinsic "subitizing" capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, the authors apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues.

Furthermore, the paper incorporates a fine-grained density classification task to help with feature learning. This general approach can be applied to most existing crowd counting methods as it doesn't have strict structural or loss constraints.

The researchers observe that the model trained with their framework exhibits a "subitizing-like" behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. This results in state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF.

Critical Analysis

The paper presents a novel and promising approach to address the heavy annotation burden for training reliable crowd counting models. By leveraging the model's "subitizing" capability and incorporating a fine-grained density classification task, the researchers are able to achieve state-of-the-art performance on challenging benchmarks with fewer labeled data.

However, the paper does not provide a detailed analysis of the model's performance on different crowd density scenarios. It would be interesting to understand how the "subitizing-like" behavior of the model holds up in extreme cases, such as extremely sparse or dense crowds. Additionally, the researchers could explore the transferability of the learned features to other crowd counting datasets or tasks, as highlighted in DollarCrowdDiffDollar and Learning to Count Without Annotations.

Furthermore, the paper could delve deeper into the interpretability of the model's decision-making process, as discussed in FUSS-Free Network. Understanding how the model leverages the holistic cues and local details to arrive at its predictions could provide valuable insights for further improving the model's performance and robustness.

Conclusion

This paper presents a semi-supervised method to train a crowd counting model with fewer labeled data, leveraging the model's intrinsic "subitizing" capability and incorporating a fine-grained density classification task. The proposed approach achieves state-of-the-art performance on challenging benchmarks, demonstrating the potential of this technique to alleviate the heavy annotation burden and make crowd counting models more practicable and accurate. Further research on the model's performance in extreme scenarios, transferability, and interpretability could lead to even more robust and versatile crowd counting solutions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

Yifei Qian, Xiaopeng Hong, Zhongliang Guo, Ognjen Arandjelovi'c, Carl R. Donovan

To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

4/23/2024

📶

Learning Discriminative Features for Crowd Counting

Yuehai Chen, Qingzhong Wang, Jing Yang, Badong Chen, Haoyi Xiong, Shaoyi Du

Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.

6/19/2024

Multi-modal Crowd Counting via Modal Emulation

Chenhao Wang, Xiaopeng Hong, Zhiheng Ma, Yupeng Wei, Yabin Wang, Xiaopeng Fan

Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two key components: a emph{multi-modal inference} pass and a emph{cross-modal emulation} pass. The former utilizes a hybrid cross-modal attention module to extract global and local information and achieve efficient multi-modal fusion. The latter uses attention prompting to coordinate different modalities and enhance multi-modal alignment. We also introduce a modality alignment module that uses an efficient modal consistency loss to align the outputs of the two passes and bridge the semantic gap between modalities. Extensive experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods. Code available at https://github.com/Mr-Monday/Multi-modal-Crowd-Counting-via-Modal-Emulation.

7/30/2024

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

Jia Wan, Qiangqiang Wu, Wei Lin, Antoni B. Chan

The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM's performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM's predictions, thereby enhancing the training process of the counting networks. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling stage. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable results to some supervised methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.

8/16/2024