Pattern Recognition
DS-Span: Single-Phase Discriminative Subgraph Mining for Efficient Graph Embeddings
Kaiser, Yeamin, Anwar, Muhammed Tasnim Bin, Das, Bholanath
Graph representation learning seeks to transform complex, high-dimensional graph structures into compact vector spaces that preserve both topology and semantics. Among the various strategies, subgraph-based methods provide an interpretable bridge between symbolic pattern discovery and continuous embedding learning. Yet, existing frequent or discriminative subgraph mining approaches often suffer from redundant multi-phase pipelines, high computational cost, and weak coupling between mined structures and their discriminative relevance. We propose DS-Span, a single-phase discriminative subgraph mining framework that unifies pattern growth, pruning, and supervision-driven scoring within one traversal of the search space. DS-Span introduces a coverage-capped eligibility mechanism that dynamically limits exploration once a graph is sufficiently represented, and an information-gain-guided selection that promotes subgraphs with strong class-separating ability while minimizing redundancy. The resulting subgraph set serves as an efficient, interpretable basis for downstream graph embedding and classification. Extensive experiments across benchmarks demonstrate that DS-Span generates more compact and discriminative subgraph features than prior multi-stage methods, achieving higher or comparable accuracy with significantly reduced runtime.
HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
Truc, Pham Thach Thanh, Nam, Dang Hoai, Khoa, Huynh Tong Dang, Duy, Vo Nguyen Le
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
Park, Simon, Kaur, Simran, Arora, Sanjeev
While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
Fast, Robust, Permutation-and-Sign Invariant SO(3) Pattern Alignment
Abstract--We address the correspondence-free alignment of two rotation sets on SO(3), a core task in calibration and registration that is often impeded by missing time alignment, outliers, and unknown axis conventions. T o handle axis relabels and sign flips, we introduce a Permutation-and-Sign Invariant (PASI) wrapper that enumerates the 24 proper signed permutations, scores them via summed correlations, and fuses the per-axis estimates into a single rotation by projection/Karcher mean. Experiments on EuRoC Machine Hall simulations (axis-consistent) and the ETH Hand-Eye benchmark (robot_arm_real) (axis-ambiguous) show that our methods are accurate, 6-60x faster than traditional methods, and robust under extreme outlier ratios (up to 90%), all without correspondence search. Estimating the 3D rotation that aligns one sensor or object frame to another is a fundamental problem in robotics and computer vision. Closed-form or least-squares solutions (e.g., Davenport/QUEST, SVD/Procrustes, and modern quaternion solvers) are mature [25], [26], [27], [28], [29], but they typically assume paired measurements (known correspondences) and degrade under heavy outliers or axis-convention mismatches.
Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
Sahoo, Subramanyam, Jain, Vinija, Vats, Saanidhya, Mohapatra, Siddharth, Min, Rui, Chadha, Aman, Chaudhary, Divya
Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity--while the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5
Nguyen, Minh Hoang, Thiet, Su Nguyen
Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5
Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
De, Anik, Penamakuri, Abhirama Subramanyam, Yadav, Rajeev, Rathore, Aditya, Shah, Harshiv, Sharma, Devesh, Agarwal, Sagar, Kumar, Pravin, Mishra, Anand
Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Chang, Hoyeon, Park, Jinho, Cho, Hanseul, Yang, Sohee, Ko, Miyoung, Hwang, Hyeonbin, Won, Seungpil, Lee, Dohaeng, Ahn, Youbin, Seo, Minjoon
Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.
Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories
Yang, Rushuai, Feng, Zhiyuan, Zhang, Tianxiang, Wang, Kaixin, Zhang, Chuheng, Zhao, Li, Su, Xiu, Chen, Yi, Bian, Jiang
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
CORE -- A Cell-Level Coarse-to-Fine Image Registration Engine for Multi-stain Image Alignment
Nasir, Esha Sadia, Elhaminia, Behnaz, Eastwood, Mark, King, Catherine, Cain, Owen, Harper, Lorraine, Moss, Paul, Chanouzas, Dimitrios, Snead, David, Rajpoot, Nasir, Shephard, Adam, Raza, Shan E Ahmed
Accurate and efficient registration of whole slide images (WSIs) is essential for high-resolution, nuclei-level analysis in multi-stained tissue slides. We propose a novel coarse-to-fine framework CORE for accurate nuclei-level registration across diverse multimodal whole-slide image (WSI) datasets. The coarse registration stage leverages prompt-based tissue mask extraction to effectively filter out artefacts and non-tissue regions, followed by global alignment using tissue morphology and ac- celerated dense feature matching with a pre-trained feature extractor. From the coarsely aligned slides, nuclei centroids are detected and subjected to fine-grained rigid registration using a custom, shape-aware point-set registration model. Finally, non-rigid alignment at the cellular level is achieved by estimating a non-linear dis- placement field using Coherent Point Drift (CPD). Our approach benefits from automatically generated nuclei that enhance the accuracy of deformable registra- tion and ensure precise nuclei-level correspondence across modalities. The pro- posed model is evaluated on three publicly available WSI registration datasets, and two private datasets. We show that CORE outperforms current state-of-the-art methods in terms of generalisability, precision, and robustness in bright-field and immunofluorescence microscopy WSIs