AITopics | Pattern Recognition

Collaborating Authors

Pattern Recognition

"... the research area that studies the operation and design of systems that recognize patterns in data." It includes statistical methods like discriminant analysis, feature extraction, error estimation, cluster analysis.
– Pattern Recognition Laboratory at Delft University of Technology

News Overviews Instructional Materials AI-Alerts Classics

DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition

Liu, Sheng, Yu, Yiheng, Feng, Yuan, Xu, Min, Jin, Zhelun, Jiang, Yining, Yuan, Tiantian

arXiv.org Artificial IntelligenceJul-8-2025

Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. T o address this, we propose DESign, a novel framework that incorporates Dynamic Context-A ware Convolution (DCAC) and Subnet Regularization Connectionist T emporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. T o address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.

machine learning, pattern recognition, recognition, (19 more...)

arXiv.org Artificial Intelligence

2507.03339

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.48)
Education > Curriculum > Subject-Specific Education (0.44)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.93)
(2 more...)

Add feedback

ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

Wang, Xiao, Jiang, Jingtao, Chen, Qiang, Chen, Lan, Zhu, Lin, Wang, Yaowei, Tian, Yonghong, Tang, Jin

arXiv.org Artificial IntelligenceJul-4-2025

Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

large language model, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2507.022

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Anhui Province > Hefei (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Text-Aware Image Restoration with Diffusion Models

Min, Jaewon, Kim, Jin Hyeon, Cho, Paul Hyunbin, Lee, Jaeeun, Park, Jihye, Park, Minkyu, Kim, Sangpil, Park, Hyunhee, Kim, Seungryong

arXiv.org Artificial IntelligenceJul-4-2025

Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

artificial intelligence, machine learning, pattern recognition, (14 more...)

arXiv.org Artificial Intelligence

2506.09993

Genre: Research Report (0.63)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.69)

Add feedback

MARVIS: Modality Adaptive Reasoning over VISualizations

Feuer, Benjamin, Purucker, Lennart, Elachqar, Oussama, Hegde, Chinmay

arXiv.org Artificial IntelligenceJul-3-2025

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis

large language model, machine learning, pattern recognition, (21 more...)

arXiv.org Artificial Intelligence

2507.01544

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (0.92)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Information Technology (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

Hierarchical Modeling and Architecture Optimization: Review and Unified Framework

Saves, Paul, Hallé-Hannan, Edward, Bussemaker, Jasper, Diouane, Youssef, Bartoli, Nathalie

arXiv.org Machine LearningJul-1-2025

Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. The framework supports the use of surrogate models over such domains and integrates hierarchical kernels and distances for efficient modeling and optimization. The proposed methods are implemented in the open-source Surrogate Modeling Toolbox (SMT 2.0), and their capabilities are demonstrated through applications in Bayesian optimization for complex system design, including a case study in green aircraft architecture.

data mining, machine learning, pattern recognition, (23 more...)

arXiv.org Machine Learning

2506.22621

Country:

Europe > Austria > Vienna (0.14)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
North America > Canada > Quebec > Montreal (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Aerospace & Defense (1.00)
Transportation > Air (0.46)

Technology:

Information Technology > Software (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(6 more...)

Add feedback

Logios : An open source Greek Polytonic Optical Character Recognition system

Konstantinos, Perifanos, Dionisis, Goutsos

arXiv.org Artificial IntelligenceJun-27-2025

--In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use. I. Introduction Historical Greek polytonic scripts have a rather complex target vocabulary and various set of rules resulting in a large character set of more than 200 characters, including the acute accent, the grave accent, the circumflex, the rough breathing (dasi pneuma), the smooth breathing (psilon pneuma), the diaeresis and the iota subscript [1].

machine learning, pattern recognition, recognition, (15 more...)

arXiv.org Artificial Intelligence

2506.21474

Country: Europe > Greece > Attica > Athens (0.05)

Genre: Research Report (0.53)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

Käs, Stephanie, Burenko, Anton, Markert, Louis, Culha, Onur Alp, Mack, Dennis, Linder, Timm, Leibe, Bastian

arXiv.org Artificial IntelligenceJun-27-2025

How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? Abstract -- Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head--thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

large language model, machine learning, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.20795

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision > Gesture Recognition (1.00)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Patapati, Santosh, Srinivasan, Trisanth, Adiraju, Amith

arXiv.org Artificial IntelligenceJun-23-2025

Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.

machine learning, pattern recognition, recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.16385

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision > Gesture Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Non-planar Object Detection and Identification by Features Matching and Triangulation Growth

Leveni, Filippo

arXiv.org Artificial IntelligenceJun-18-2025

Object detection and identification is surely a fundamental topic in the computer vision field; it plays a crucial role in many applications such as object tracking, industrial robots control, image retrieval, etc. We propose a feature-based approach for detecting and identifying distorted occurrences of a given template in a scene image by incremental grouping of feature matches between the image and the template. For this purpose, we consider the Delaunay triangulation of template features as an useful tool through which to be guided in this iterative approach. The triangulation is treated as a graph and, starting from a single triangle, neighboring nodes are considered and the corresponding features are identified; then matches related to them are evaluated to determine if they are worthy to be grouped. This evaluation is based on local consistency criteria derived from geometric and photometric properties of local features. Our solution allows the identification of the object in situations where geometric models (e.g. homography) does not hold, thus enable the detection of objects such that the template is non planar or when it is planar but appears distorted in the image. We show that our approach performs just as well or better than application of homography-based RANSAC in scenarios in which distortion is nearly absent, while when the deformation becomes relevant our method shows better description performance.

key point, machine learning, pattern recognition, (15 more...)

arXiv.org Artificial Intelligence

2506.13769

Genre: Research Report (0.81)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.67)
(2 more...)

Add feedback

MetaEformer: Unveiling and Leveraging Meta-patterns for Complex and Dynamic Systems Load Forecasting

Huang, Shaoyuan, Zhang, Tiancheng, Zhang, Zhongtian, Wang, Xiaofei, Wang, Lanjun, Wang, Xin

arXiv.org Artificial IntelligenceJun-17-2025

Time series forecasting is a critical and practical problem in many real-world applications, especially for industrial scenarios, where load forecasting underpins the intelligent operation of modern systems like clouds, power grids and traffic networks.However, the inherent complexity and dynamics of these systems present significant challenges. Despite advances in methods such as pattern recognition and anti-non-stationarity have led to performance gains, current methods fail to consistently ensure effectiveness across various system scenarios due to the intertwined issues of complex patterns, concept-drift, and few-shot problems. To address these challenges simultaneously, we introduce a novel scheme centered on fundamental waveform, a.k.a., meta-pattern. Specifically, we develop a unique Meta-pattern Pooling mechanism to purify and maintain meta-patterns, capturing the nuanced nature of system loads. Complementing this, the proposed Echo mechanism adaptively leverages the meta-patterns, enabling a flexible and precise pattern reconstruction. Our Meta-pattern Echo transformer (MetaEformer) seamlessly incorporates these mechanisms with the transformer-based predictor, offering end-to-end efficiency and interpretability of core processes. Demonstrating superior performance across eight benchmarks under three system scenarios, MetaEformer marks a significant advantage in accuracy, with a 37% relative improvement on fifteen state-of-the-art baselines.

artificial intelligence, machine learning, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.128

Country:

North America > Canada (0.18)
Asia > China (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.54)

Add feedback