Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

Read original: arXiv:2402.17514 - Published 8/16/2024 by Jia Wan, Qiangqiang Wu, Wei Lin, Antoni B. Chan

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

Overview

Presents a robust unsupervised crowd counting and localization method using an Adaptive Resolution Segment Attention Model (AR-SAM)
Addresses challenges in crowd counting, such as varying densities, scales, and occlusions, through an adaptive resolution approach
Leverages self-attention to capture long-range dependencies and localize individuals in crowded scenes

Plain English Explanation

This research paper introduces a new method for counting and locating people in crowded environments, such as busy streets or events. The key innovation is the use of an Adaptive Resolution Segment Attention Model (AR-SAM), which adapts the resolution of the input image to better handle varying crowd densities and scales.

The method works by first breaking the image into segments, then using a self-attention mechanism to understand the relationships between different parts of the crowd. This allows it to identify and count individual people, even in very dense or occluded areas. The adaptive resolution approach ensures the model can handle a wide range of crowd sizes and densities effectively.

Overall, this unsupervised technique provides a robust solution for crowd analysis tasks, such as monitoring events, managing public spaces, or gathering demographic data - without requiring extensive manual labeling of training data.

Technical Explanation

The paper presents an Adaptive Resolution Segment Attention Model (AR-SAM) for unsupervised crowd counting and localization. The key components are:

Adaptive Resolution Segmentation: The input image is adaptively divided into segments of varying sizes based on the local crowd density. This allows the model to focus on high-density areas with smaller segments and low-density areas with larger segments.
Segment Attention Module (SAM): A self-attention mechanism is applied to the segmented features to capture long-range dependencies between crowd instances. This helps identify and localize individual people, even in heavily occluded scenes.
Counting Head: A lightweight counting head is used to predict the number of people in each segment, which are then summed to obtain the total crowd count.

The model is trained in an unsupervised manner using only image-level crowd count annotations, without requiring costly per-instance labeling. This is achieved through a semi-supervised learning approach that leverages pseudo-labels generated by the model itself.

Critical Analysis

The paper presents a novel and promising approach to unsupervised crowd counting and localization. The key strengths are:

Adaptive Resolution: The ability to adjust the segment size based on local crowd density is a valuable feature that helps the model handle a wide range of crowd conditions effectively.
Self-Attention: The use of self-attention to capture long-range dependencies is crucial for accurately localizing individuals in crowded scenes.
Unsupervised Learning: The ability to train the model with only image-level annotations reduces the burden of costly per-instance labeling, making the approach more scalable.

However, the paper also acknowledges some limitations and areas for further research:

Generalization: The performance of the model on diverse crowd scenes, such as different environments, demographic compositions, or camera viewpoints, should be further evaluated.
Real-Time Performance: The computational efficiency of the model should be assessed to enable real-time applications, such as live event monitoring or surveillance.
Ethical Considerations: The use of crowd counting and localization techniques raises important ethical questions regarding privacy, bias, and potential misuse that should be carefully considered.

Conclusion

This research presents a robust unsupervised crowd counting and localization method using an Adaptive Resolution Segment Attention Model (AR-SAM). The key innovations, such as adaptive resolution segmentation and self-attention-based localization, demonstrate the potential of this approach to address the challenges of crowd analysis in complex, real-world scenarios.

While further research is needed to address the limitations and ethical considerations, this work represents an important step forward in developing more efficient and scalable crowd monitoring solutions that can benefit applications in urban planning, event management, and public safety.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

Jia Wan, Qiangqiang Wu, Wei Lin, Antoni B. Chan

The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM's performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM's predictions, thereby enhancing the training process of the counting networks. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling stage. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable results to some supervised methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.

8/16/2024

Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

Zhi Cai, Yingjie Gao, Yaoyan Zheng, Nan Zhou, Di Huang

In computer vision, object detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Recently, the Segment Anything Model (SAM) has been proposed as a powerful zero-shot segmenter, offering a novel approach to instance segmentation tasks. However, the accuracy and efficiency of SAM and its variants are often compromised when handling objects in crowded and occluded scenes. In this paper, we introduce Crowd-SAM, a SAM-based framework designed to enhance SAM's performance in crowded and occluded scenes with the cost of few learnable parameters and minimal labeled images. We introduce an efficient prompt sampler (EPS) and a part-whole discrimination network (PWD-Net), enhancing mask selection and accuracy in crowded scenes. Despite its simplicity, Crowd-SAM rivals state-of-the-art (SOTA) fully-supervised object detection methods on several benchmarks including CrowdHuman and CityPersons. Our code is available at https://github.com/FelixCaae/CrowdSAM.

7/22/2024

🤔

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

Yifei Qian, Xiaopeng Hong, Zhongliang Guo, Ognjen Arandjelovi'c, Carl R. Donovan

To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

4/23/2024

SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation

Danni Yang, Jiayi Ji, Yiwei Ma, Tianyu Guo, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

In this paper, we introduce SemiRES, a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. A significant hurdle in applying semi-supervised techniques to RES is the prevalence of noisy pseudo-labels, particularly at the boundaries of objects. SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarcation, to improve the accuracy of these pseudo-labels. Within SemiRES, we offer two alternative matching strategies: IoU-based Optimal Matching (IOM) and Composite Parts Integration (CPI). These strategies are designed to extract the most accurate masks from SAM's output, thus guiding the training of the student model with enhanced precision. In instances where a precise mask cannot be matched from the available candidates, we develop the Pixel-Wise Adjustment (PWA) strategy, guiding the student model's training directly by the pseudo-labels. Extensive experiments on three RES benchmarks--RefCOCO, RefCOCO+, and G-Ref reveal its superior performance compared to fully supervised methods. Remarkably, with only 1% labeled data, our SemiRES outperforms the supervised baseline by a large margin, e.g. +18.64% gains on RefCOCO val set. The project code is available at url{https://github.com/nini0919/SemiRES}.

6/4/2024