Pattern Recognition
How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI
Wen, Bo, Wang, Chen, Bilal, Erhan
ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.
- Research Report > New Finding (1.00)
- Workflow (0.93)
A Simple Cache Model for Image Recognition
Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or fine-tune it with new data. Here, we show that, surprisingly, this is indeed possible. The key observation we make is that the layers of a deep network close to the output layer contain independent, easily extractable class-relevant information that is not contained in the output layer itself. We propose to extract this extra class-relevant information using a simple key-value cache memory to improve the classification performance of the model at test time.
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.40)
- North America > United States > New York (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Asia > South Korea > Daejeon > Daejeon (0.05)
- North America > Canada > Quebec > Montreal (0.04)
edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer
Qian, Chen, Yu, Xinran, Huang, Zewen, Li, Danyang, Ma, Qiang, Dang, Fan, Ding, Xuan, Shang, Guangyong, Yang, Zheng
Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. T o meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offload-ing strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-language reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems. Code will be publicly released before publication.
- Information Technology (0.90)
- Transportation > Ground > Road (0.35)
DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision-Language Transformers to Missing Modalities
Lu, Jueqing, Qi, Yuanyuan, Yang, Xiaohao, Niu, Shuaicheng, Ke, Fucai, Zhou, Shujie, Tan, Wei, Lin, Jionghao, Buntine, Wray, Rezatofighi, Hamid, Du, Lan
The performance of Vision-Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missing-aware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. W e introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. F or each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information actually present. This adaptive design allows DPL to handle inputs with missing modalities more effectively while remaining fully compatible with existing prompt-based frameworks. Extensive experiments on MM-IMDb, UPMC F ood-101, and Hateful Memes demonstrate that DPL outperforms state-of-the-art approaches across all widely used multimodal image-text datasets and various missing cases.
- North America > Dominican Republic (0.04)
- Asia > China > Hong Kong (0.04)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
Zhang, Jiarui, Liu, Yuliang, Wu, Zijun, Pang, Guosheng, Ye, Zhili, Zhong, Yupei, Ma, Junteng, Wei, Tao, Xu, Haiyang, Chen, Weikai, Wang, Zeen, Ji, Qiangjun, Zhou, Fanxi, Zhang, Qi, Hu, Yuanrui, Liu, Jiahao, Li, Zhang, Zhang, Ziyang, Liu, Qiang, Bai, Xiang
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage pipeline. The first stage employs a large multimodal model to jointly predict layout and reading order, leveraging visual information to ensure sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios. A trial link can be found at https://github.com/Yuliang-Liu/MonkeyOCR .
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
- (2 more...)
Limitations of Quantum Advantage in Unsupervised Machine Learning
Machine learning models are used for pattern recognition analysis of big data, without direct human intervention. The task of unsupervised learning is to find the probability distribution that would best describe the available data, and then use it to make predictions for observables of interest. Classical models generally fit the data to Boltzmann distribution of Hamiltonians with a large number of tunable parameters. Quantum extensions of these models replace classical probability distributions with quantum density matrices. An advantage can be obtained only when features of density matrices that are absent in classical probability distributions are exploited. Such situations depend on the input data as well as the targeted observables. Explicit examples are discussed that bring out the constraints limiting possible quantum advantage. The problem-dependent extent of quantum advantage has implications for both data analysis and sensing applications.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Asia > India > Karnataka > Bengaluru (0.05)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Masking criteria for selecting an imputation model
Yang, Yanjiao, Suen, Daniel, Chen, Yen-Chi
Missing data is a common problem across various scientific disciplines, including medical research (Bell et al., 2014), social sciences (Molenberghs et al., 2014), and astronomy (Ivezi c et al., 2020). To handle missing entries in the dataset, imputation (Grzesiak et al., 2025; Kim and Shao, 2021; Little and Rubin, 2019) is a popular approach that is widely accepted in practice. An imputation model generates plausible values for each missing entry, transforming an incomplete dataset into a complete one. The critical importance of this task has led to the development of a wide array of imputation models, grounded in various modeling assumptions. These range from traditional approaches like hot-deck imputation (Little and Rubin, 2019) to more sophisticated methods such as Multiple Imputation via Chained Equations (MICE; V an Buuren and Groothuis-Oudshoorn 2011), random forest imputation (Stekhoven and Bühlmann, 2012), techniques based on Markov assumptions on graphs (Y ang and Chen, 2025), and even generative adversarial networks (Y oon et al., 2018). Despite the proliferation of imputation models, the selection of an optimal imputation model for a given dataset remains a significant challenge, largely due to the unsupervised nature of the problem. Among the many proposed strategies for evaluating and selecting imputation models, masking has emerged as a particularly popular procedure (Gelman et al., 1998; Honaker et al., 2011; Leek et al., 2012; Qian et al., 2024; Troyanskaya et al., 2001; Wang et al., 2024). Masking involves intentionally creating missing values in observed entries to create a setting where imputation accuracy can be measured against a known ground truth. This approach has demonstrated remarkable success and power in other domains, notably in language modeling (Devlin et al., 2019; Y ang et al., 2019) and image recognition (Hondru et al., 2025; Vincent et al., 2010; Xie et al., 2022) and prediction-powered inference (Angelopoulos et al., 2023; Wang et al., 2020).
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.47)