Pattern Recognition
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Liu, Hui, Wang, Wenya, Chen, Kecheng, Liu, Jie, Liu, Yibing, Qin, Tiexin, He, Peisong, Jiang, Xinghao, Li, Haoliang
In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
Maracani, Andrea, Ozkan, Savas, Cho, Sijun, Kim, Hyowon, Noh, Eunchung, Min, Jeongwon, Min, Cho Jung, Park, Dookun, Ozay, Mete
Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants
Atoum, Jumanh, Johnston, Garrison L. H., Simaan, Nabil, Wu, Jie Ying
Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3\% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.
EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera
Wang, Luming, Shi, Hao, Yin, Xiaoting, Yang, Kailun, Wang, Kaiwei
-- Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. T o address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-T emporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high interpersonal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets.
Enhancing Hand Palm Motion Gesture Recognition by Eliminating Reference Frame Bias via Frame-Invariant Similarity Measures
Verduyn, Arno, Vochten, Maxim, De Schutter, Joris
The ability of robots to recognize human gestures facilitates a natural and accessible human-robot collaboration. However, most work in gesture recognition remains rooted in reference frame-dependent representations. This poses a challenge when reference frames vary due to different work cell layouts, imprecise frame calibrations, or other environmental changes. This paper investigated the use of invariant trajectory descriptors for robust hand palm motion gesture recognition under reference frame changes. First, a novel dataset of recorded Hand Palm Motion (HPM) gestures is introduced. The motion gestures in this dataset were specifically designed to be distinguishable without dependence on specific reference frames or directional cues. Afterwards, multiple invariant trajectory descriptor approaches were benchmarked to assess how their performances generalize to this novel HPM dataset. After this offline benchmarking, the best scoring approach is validated for online recognition by developing a real-time Proof of Concept (PoC). In this PoC, hand palm motion gestures were used to control the real-time movement of a manipulator arm. The PoC demonstrated a high recognition reliability in real-time operation, achieving an $F_1$-score of 92.3%. This work demonstrates the effectiveness of the invariant descriptor approach as a standalone solution. Moreover, we believe that the invariant descriptor approach can also be utilized within other state-of-the-art pattern recognition and learning systems to improve their robustness against reference frame variations.
From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences
Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.
Moss: Proxy Model-based Full-Weight Aggregation in Federated Learning with Heterogeneous Models
Cai, Yifeng, Zhang, Ziqi, Li, Ding, Guo, Yao, Chen, Xiangqun
Modern Federated Learning (FL) has become increasingly essential for handling highly heterogeneous mobile devices. Current approaches adopt a partial model aggregation paradigm that leads to sub-optimal model accuracy and higher training overhead. In this paper, we challenge the prevailing notion of partial-model aggregation and propose a novel "full-weight aggregation" method named Moss, which aggregates all weights within heterogeneous models to preserve comprehensive knowledge. Evaluation across various applications demonstrates that Moss significantly accelerates training, reduces on-device training time and energy consumption, enhances accuracy, and minimizes network bandwidth utilization when compared to state-of-the-art baselines.
Event-Driven Implementation of a Physical Reservoir Computing Framework for superficial EMG-based Gesture Recognition
Ding, Yuqi, Donati, Elisa, Li, Haobo, Heidari, Hadi
Wearable health devices have a strong demand in real-time biomedical signal processing. However traditional methods often require data transmission to centralized processing unit with substantial computational resources after collecting it from edge devices. Neuromorphic computing is an emerging field that seeks to design specialized hardware for computing systems inspired by the structure, function, and dynamics of the human brain, offering significant advantages in latency and power consumption. This paper explores a novel neuromorphic implementation approach for gesture recognition by extracting spatiotemporal spiking information from surface electromyography (sEMG) data in an event-driven manner. At the same time, the network was designed by implementing a simple-structured and hardware-friendly Physical Reservoir Computing (PRC) framework called Rotating Neuron Reservoir (RNR) within the domain of Spiking neural network (SNN). The spiking RNR (sRNR) is promising to pipeline an innovative solution to compact embedded wearable systems, enabling low-latency, real-time processing directly at the sensor level. The proposed system was validated by an open-access large-scale sEMG database and achieved an average classification accuracy of 74.6\% and 80.3\% using a classical machine learning classifier and a delta learning rule algorithm respectively. While the delta learning rule could be fully spiking and implementable on neuromorphic chips, the proposed gesture recognition system demonstrates the potential for near-sensor low-latency processing.
Robustness of Generalized Median Computation for Consensus Learning in Arbitrary Spaces
Nienkötter, Andreas, Vega-Pons, Sandro, Jiang, Xiaoyi
Robustness in terms of outliers is an important topic and has been formally studied for a variety of problems in machine learning and computer vision. Generalized median computation is a special instance of consensus learning and a common approach to finding prototypes. Related research can be found in numerous problem domains with a broad range of applications. So far, however, robustness of generalized median has only been studied in a few specific spaces. To our knowledge, there is no robustness characterization in a general setting, i.e. for arbitrary spaces. We address this open issue in our work. The breakdown point >=0.5 is proved for generalized median with metric distance functions in general. We also study the detailed behavior in case of outliers from different perspectives. In addition, we present robustness results for weighted generalized median computation and non-metric distance functions. Given the importance of robustness, our work contributes to closing a gap in the literature. The presented results have general impact and applicability, e.g. providing deeper understanding of generalized median computation and practical guidance to avoid non-robust computation.
LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts
Le, Thanh-Phong, Phan, Trung Le Chi, Nguyen, Nghia Hieu, Van Nguyen, Kiet
Document Visual Question Answering (Document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.