Learning Discriminative Features for Crowd Counting

Read original: arXiv:2311.04509 - Published 6/19/2024 by Yuehai Chen, Qingzhong Wang, Jing Yang, Badong Chen, Haoyi Xiong, Shaoyi Du

📶

Overview

Crowd counting models in highly congested areas face challenges with weak localization ability and difficulty differentiating foreground and background, leading to inaccurate estimations.
This is because objects in dense crowds are often small, and high-level features extracted by convolutional neural networks are less effective at representing them.
To address these problems, the paper proposes a "learning discriminative features" framework for crowd counting, consisting of two modules: a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM).

Plain English Explanation

The paper tackles a problem in crowd counting, which is the task of estimating the number of people in a crowded scene. In highly congested areas, current crowd counting models struggle with two main issues:

Weak Localization Ability: The models have difficulty accurately identifying the location of individual people in the crowd. This is because people in dense crowds are often quite small in the image, and the high-level features used by the models are not well-suited for representing these small objects.
Difficulty Differentiating Foreground and Background: The models have trouble distinguishing between the people in the crowd (the foreground) and the background of the scene. This also contributes to inaccurate crowd counts.

To address these challenges, the researchers propose a new framework that includes two key components:

Masked Feature Prediction Module (MPM): This module randomly hides or "masks" parts of the feature map used by the model and then tries to reconstruct the missing information. This helps the model learn better representations of what is present in the crowded scenes, improving its ability to localize individual people.
Supervised Pixel-level Contrastive Learning Module (CLM): This module pulls the features of people in the crowd closer together in the feature space, while pushing the background features further away. This enables the model to better differentiate the people (foreground) from the background, leading to more accurate crowd counts.

The researchers show that incorporating these two modules into existing crowd counting models can boost their performance in dense, cluttered environments. The proposed framework and techniques could also be beneficial for other computer vision tasks, such as object detection, where dealing with crowded or cluttered scenes is a challenge.

Technical Explanation

The paper proposes a "learning discriminative features" framework for crowd counting, which consists of two key components:

Masked Feature Prediction Module (MPM): This module randomly masks or hides parts of the feature map generated by the convolutional neural network (CNN) and then tries to reconstruct the missing information. By learning to predict the masked features, the model is forced to understand what is present in the crowded scenes, which helps improve its ability to localize individual people.
Supervised Pixel-level Contrastive Learning Module (CLM): This module uses a contrastive learning approach to pull the features of people in the crowd closer together in the feature space, while pushing the background features further away. This enables the model to better differentiate the foreground (people) from the background, leading to more accurate crowd counts.

The researchers evaluated the proposed framework on several crowd counting datasets, including some with highly congested scenes. They found that incorporating the MPM and CLM modules into existing crowd counting models, such as MCNet and SMCL, can significantly improve their performance in these challenging scenarios.

Critical Analysis

The paper presents a novel and promising approach to addressing the key challenges in crowd counting for highly congested areas. The proposed MPM and CLM modules seem to effectively tackle the issues of weak localization ability and difficulty in differentiating foreground and background.

However, the paper does not provide much discussion on the potential limitations or caveats of the proposed framework. For example, it would be useful to understand how the framework might perform in more diverse or harder-to-annotate datasets, or how sensitive it is to factors like camera angle, occlusion, or environmental conditions.

Additionally, the paper could have explored the broader applicability of the proposed techniques beyond crowd counting, such as their potential use in semi-supervised crowd counting or single-domain generalization for crowd counting.

Overall, the research presents a compelling approach to improving crowd counting in dense scenes, and the proposed modules could potentially be valuable contributions to the field of unsupervised representation learning as well.

Conclusion

The paper addresses a crucial challenge in crowd counting for highly congested areas, where current models struggle with weak localization ability and difficulty differentiating foreground and background. By proposing a "learning discriminative features" framework with a masked feature prediction module and a supervised pixel-level contrastive learning module, the researchers have developed a promising approach to improving crowd counting accuracy in these challenging scenarios.

The techniques presented in this paper could have broader implications for other computer vision tasks that involve dealing with dense, cluttered environments. As the researchers suggest, incorporating these modules into existing models has the potential to boost performance in a variety of applications beyond crowd counting.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📶

Learning Discriminative Features for Crowd Counting

Yuehai Chen, Qingzhong Wang, Jing Yang, Badong Chen, Haoyi Xiong, Shaoyi Du

Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.

6/19/2024

🤔

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

Yifei Qian, Xiaopeng Hong, Zhongliang Guo, Ognjen Arandjelovi'c, Carl R. Donovan

To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

4/23/2024

SMCL: Saliency Masked Contrastive Learning for Long-tailed Recognition

Sanglee Park, Seung-won Hwang, Jungmin So

Real-world data often follow a long-tailed distribution with a high imbalance in the number of samples between classes. The problem with training from imbalanced data is that some background features, common to all classes, can be unobserved in classes with scarce samples. As a result, this background correlates to biased predictions into ``major classes. In this paper, we propose saliency masked contrastive learning, a new method that uses saliency masking and contrastive learning to mitigate the problem and improve the generalizability of a model. Our key idea is to mask the important part of an image using saliency detection and use contrastive learning to move the masked image towards minor classes in the feature space, so that background features present in the masked image are no longer correlated with the original class. Experiment results show that our method achieves state-of-the-art level performance on benchmark long-tailed datasets.

6/5/2024

Beyond Dropout: Robust Convolutional Neural Networks Based on Local Feature Masking

Yunpeng Gong, Chuangliang Zhang, Yongjie Hou, Lifei Chen, Min Jiang

In the contemporary of deep learning, where models often grapple with the challenge of simultaneously achieving robustness against adversarial attacks and strong generalization capabilities, this study introduces an innovative Local Feature Masking (LFM) strategy aimed at fortifying the performance of Convolutional Neural Networks (CNNs) on both fronts. During the training phase, we strategically incorporate random feature masking in the shallow layers of CNNs, effectively alleviating overfitting issues, thereby enhancing the model's generalization ability and bolstering its resilience to adversarial attacks. LFM compels the network to adapt by leveraging remaining features to compensate for the absence of certain semantic features, nurturing a more elastic feature learning mechanism. The efficacy of LFM is substantiated through a series of quantitative and qualitative assessments, collectively showcasing a consistent and significant improvement in CNN's generalization ability and resistance against adversarial attacks--a phenomenon not observed in current and prior methodologies. The seamless integration of LFM into established CNN frameworks underscores its potential to advance both generalization and adversarial robustness within the deep learning paradigm. Through comprehensive experiments, including robust person re-identification baseline generalization experiments and adversarial attack experiments, we demonstrate the substantial enhancements offered by LFM in addressing the aforementioned challenges. This contribution represents a noteworthy stride in advancing robust neural network architectures.

7/19/2024