crowd counting
Incorporating Side Information by Adaptive Convolution
Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information.
Incorporating Side Information by Adaptive Convolution
Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information.
CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Wang, Zhiqiang, Feng, Pengbin, Lin, Yanbin, Cai, Shuzhang, Bian, Zongao, Yan, Jinghua, Zhu, Xingquan
CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward 1 st Zhiqiang Wang Florida Atlantic University Boca Raton, USA zwang2022@fau.edu 2 nd Pengbin Feng University of Southern California Los Angeles, USA fengpengbin.apply@gmail.com Abstract--We propose CrowdVLM-R1, which expands the R1 base model for accurate crowd counting, using a novel framework that integrates the fuzzy group relative policy optimization reward function (FGRPR) to enhance learning efficiency. Unlike the conventional binary (0/1) accuracy reward, our fuzzy reward model, FGRPR, which contains both format and precision rewards, provides nuanced incentives to encourage the R1 model to learn to adjust policies towards precise outputs. Supervised fine-tuning (SFT) is also integrated for the CrowdVLM-R1 model to learn from a handful of inputs to enable both in-domain and out-of-domain counting. Experimental results demonstrate that GRPO with a standard binary accuracy reward underperforms compared to SFT . In contrast, FGRPR, applied to Qwen2.5-VL-(3B/7B), surpasses all baseline models, including GPT -4o, LLaMA2-70B and SFT, in five domain datasets. For out-of-domain datasets, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. I. INTRODUCTION Recently, DeepSeek R1 [1] has drawn much attention among advances in large language models (LLMs), as it demonstrates how reinforcement learning (RL) can be the primary driver of reasoning.
Distribution Matching for Crowd Counting Supplementary Material
DM-Count and investigate the robustness of different methods to noisy annotations. Assume for all x D and g G we have |g ( x) | B . We propose the following five lemmas which are essential for proving the proposed theorems. Lemmas A, B, C and D give the Lipschitz constants of different loss functions. Consider the dual form of Eq. (15) W ( µ, ν) = max α The first inequality in Eq. (20) is achieved because The second equality in Eq. (20) is achieved because We restate Theorem 1 in the main paper below.
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > Canada (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Count2Density: Crowd Density Estimation without Location-level Annotations
Litrico, Mattia, Chen, Feng, Pound, Michael, Tsaftaris, Sotirios A, Battiato, Sebastiano, Giuffrida, Mario Valerio
Crowd density estimation is a well-known computer vision task aimed at estimating the density distribution of people in an image. The main challenge in this domain is the reliance on fine-grained location-level annotations, (i.e. points placed on top of each individual) to train deep networks. Collecting such detailed annotations is both tedious, time-consuming, and poses a significant barrier to scalability for real-world applications. To alleviate this burden, we present Count2Density: a novel pipeline designed to predict meaningful density maps containing quantitative spatial information using only count-level annotations (i.e., total number of people) during training. To achieve this, Count2Density generates pseudo-density maps leveraging past predictions stored in a Historical Map Bank, thereby reducing confirmation bias. This bank is initialised using an unsupervised saliency estimator to provide an initial spatial prior and is iteratively updated with an EMA of predicted density maps. These pseudo-density maps are obtained by sampling locations from estimated crowd areas using a hypergeometric distribution, with the number of samplings determined by the count-level annotations. To further enhance the spatial awareness of the model, we add a self-supervised contrastive spatial regulariser to encourage similar feature representations within crowded regions while maximising dissimilarity with background regions. Experimental results demonstrate that our approach significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Additional analyses validate the effectiveness of each individual component of our pipeline, confirming the ability of Count2Density to effectively retrieve spatial information from count-level annotations and enabling accurate subregion counting.
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.14)
- North America > United States (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- (3 more...)
- Research Report > New Finding (0.34)
- Research Report > Promising Solution (0.34)
Crowd Scene Analysis using Deep Learning Techniques
With the recent advancement in the field of deep learning and computer vision, crowd scene analysis has gained significant attention. UN predicts world population growth of 0.82% by 2035, driving people to cities for better lifestyles and social events like concerts, shopping, political gatherings, and educational conferences. Crowd scene analysis is crucial for ensuring a safe environment in public spaces, but manual monitoring can be laborious due to the risk of missing important information. An automatic solution is needed for efficient real-life applications. Our research is focused on two main applications of crowd scene analysis: crowd counting, and anomaly detection.
- Research Report > New Finding (1.00)
- Overview (1.00)
- Instructional Material (1.00)
- Research Report > Promising Solution (0.92)
A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals
Cui, Zhe, Li, Yuli, Tran, Le-Nam
--Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. T o address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd-counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehensive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer-grained, local details essential for precise crowd counting. T o mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer . Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.