Goto

Collaborating Authors

 Problem Solving


Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks

arXiv.org Artificial Intelligence

DV ANCEMENTS in AI-driven technologies have significantly enhanced modern education through personalized tutoring and adaptive learning strategies on online platforms [1], [2]. Intelligent T utoring Systems (ITSs) exemplify this progress by leveraging advanced machine learning and natural language processing models to create interactive learning environments that improve outcomes across domains like literacy [3], mathematics [4], language learning [5], biology [6] and other STEM fields [7]. As human learners interact with ITSs, often through question-and-answer scenarios with immediate responses, their performance data becomes crucial for learner modeling, enabling systems to track progress, predict future performance, and adapt instruction accordingly [8]. Learner models like Bayesian Knowledge Tracing (BKT) and other knowledge tracing variants utilize the learner performance data to uncover learning characteristics, estimate knowledge states and acquisition [9]. However, in real-world scenarios, missing learner performance data is prevalent due to factors, such as learner dropout or disengagement [10], technical issues or incomplete data logging [11], biased sampling within experimental groups [12], and more. These challenges often lead to sparse data, where items (i.e., questions or problems) remain unattempted (e.g., learners may bypass the question, leave it unanswered due to a lack of response initiation, or make no attempt to engage with it), alongside limited learner interactions [13], [14]. As shown in Figure 1, missing performance records can occur along both the attempt and question dimensions during learner-ITS interactions. In the right portion of the figure's two matrices, entries marked with "?


Why is constrained neural language generation particularly challenging?

arXiv.org Artificial Intelligence

Recent advances in deep neural language models combined wit h the capacity of large scale datasets have accelerated the development of natural langu age generation systems that produce fluent and coherent texts (to various degrees of succ ess) in a multitude of tasks and application contexts. However, controlling the output of t hese models for specific user and task needs is still an open challenge. This is crucial not onl y to customizing the content and style of the generated language, but also to their safe and re liable deployment in the real world. We present an extensive survey on the emerging topic o f constrained neural language generation in which we formally define and categorize the pro blems of natural language generation by distinguishing between conditions and constraints (the latter being testable conditions on the output text instead of the input), present constrained text generation tasks, and review existing methods and evaluation metrics for cons trained text generation. Our aim is to highlight recent progress and trends in this emergi ng field, informing on the most promising directions and limitations towards advancing th e state-of-the-art of constrained neural language generation research.


VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

arXiv.org Artificial Intelligence

--The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion (e.g., image-to-text), and inadequate alignment between visual and textual representations. T o address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. T o enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model's capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a mul-timodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. These results highlight VLMT's strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems. The exponential growth of information in today's digital ecosystem has led to the proliferation of multimodal data--comprising text, tables, and images--across a wide range of platforms. Qi Zhi Lim is with the Faculty of Information Science and Technology, Multimedia University, Jalan A yer Keroh Lama, 75450 Melaka, Malaysia (email: 1181103589@student.mmu.edu.my). Chin Poo Lee is with the School of Computer Science, University of Nottingham Ningbo China, 199 Taikang East Road, Yinzhou District, Ningbo, Zhejiang Province, 315100, China (e-mail: leechinpoo@outlook.com). Kian Ming Lim is with the School of Computer Science, University of Nottingham Ningbo China, 199 Taikang East Road, Yinzhou District, Ningbo, Zhejiang Province, 315100, China (e-mail: Kian-Ming.Lim@nottingham.edu.cn). Multimodal Multi-hop Question Answering (MMQA) [1], [2] has emerged as a representative task in this domain, reflecting real-world information-seeking behavior where relevant evidence is scattered across multiple sources and modalities. MMQA requires models to perform two interdependent operations: retrieving relevant multimodal context and reasoning over the retrieved information to produce accurate and coherent answers. Early solutions to MMQA have largely followed modular paradigms.


L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

arXiv.org Artificial Intelligence

Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.


Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

arXiv.org Artificial Intelligence

Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.


VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

arXiv.org Artificial Intelligence

The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.


Better Decisions through the Right Causal World Model

arXiv.org Artificial Intelligence

Reinforcement learning (RL) agents have shown remarkable performances in various environments, where they can discover effective policies directly from sensory inputs. However, these agents often exploit spurious correlations in the training data, resulting in brittle behaviours that fail to generalize to new or slightly modified environments. To address this, we introduce the Causal Object-centric Model Extraction Tool (COMET), a novel algorithm designed to learn the exact interpretable causal world models (CWMs). COMET first extracts object-centric state descriptions from observations and identifies the environment's internal states related to the depicted objects' properties. Using symbolic regression, it models object-centric transitions and derives causal relationships governing object dynamics. COMET further incorporates large language models (LLMs) for semantic inference, annotating causal variables to enhance interpretability. By leveraging these capabilities, COMET constructs CWMs that align with the true causal structure of the environment, enabling agents to focus on task-relevant features. The extracted CWMs mitigate the danger of shortcuts, permitting the development of RL systems capable of better planning and decision-making across dynamic scenarios. Our results, validated in Atari environments such as Pong and Freeway, demonstrate the accuracy and robustness of COMET, highlighting its potential to bridge the gap between object-centric reasoning and causal inference in reinforcement learning.


ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning

arXiv.org Artificial Intelligence

Pre-trained large language models (LLMs) have been demonstrated to possess intrinsic reasoning capabilities that can emerge naturally when expanding the response space. However, the neural representation mechanisms underlying these intrinsic capabilities and approaches for their optimal utilization remain inadequately understood. In this work, we make the key discovery that a simple linear classifier can effectively detect intrinsic reasoning capabilities in LLMs' activation space, particularly within specific representation types and network layers. Based on this finding, we propose a classifier-guided search framework that strategically explore a tree-structured response space. In each node expansion, the classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by identifying and prioritizing more thoughtful reasoning directions for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We propose a branch-aggregation selection method that marginalizes over all supporting branches by aggregating their thoughtfulness scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework's comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.


Polygon: Symbolic Reasoning for SQL using Conflict-Driven Under-Approximation Search

arXiv.org Artificial Intelligence

We present a novel symbolic reasoning engine for SQL which can efficiently generate an input $I$ for $n$ queries $P_1, \cdots, P_n$, such that their outputs on $I$ satisfy a given property (expressed in SMT). This is useful in different contexts, such as disproving equivalence of two SQL queries and disambiguating a set of queries. Our first idea is to reason about an under-approximation of each $P_i$ -- that is, a subset of $P_i$'s input-output behaviors. While it makes our approach both semantics-aware and lightweight, this idea alone is incomplete (as a fixed under-approximation might miss some behaviors of interest). Therefore, our second idea is to perform search over an expressive family of under-approximations (which collectively cover all program behaviors of interest), thereby making our approach complete. We have implemented these ideas in a tool, Polygon, and evaluated it on over 30,000 benchmarks across two tasks (namely, SQL equivalence refutation and query disambiguation). Our evaluation results show that Polygon significantly outperforms all prior techniques.


Digital Gene: Learning about the Physical World through Analytic Concepts

arXiv.org Artificial Intelligence

Reviewing the progress in artificial intelligence over the past decade, various significant advances (e.g. object detection, image generation, large language models) have enabled AI systems to produce more semantically meaningful outputs and achieve widespread adoption in internet scenarios. Nevertheless, AI systems still struggle when it comes to understanding and interacting with the physical world. This reveals an important issue: relying solely on semantic-level concepts learned from internet data (e.g. texts, images) to understand the physical world is far from sufficient -- machine intelligence currently lacks an effective way to learn about the physical world. This research introduces the idea of analytic concept -- representing the concepts related to the physical world through programs of mathematical procedures, providing machine intelligence a portal to perceive, reason about, and interact with the physical world. Except for detailing the design philosophy and providing guidelines for the application of analytic concepts, this research also introduce about the infrastructure that has been built around analytic concepts. I aim for my research to contribute to addressing these questions: What is a proper abstraction of general concepts in the physical world for machine intelligence? How to systematically integrate structured priors with neural networks to constrain AI systems to comply with physical laws?