Goto

Collaborating Authors

 Education


Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs

arXiv.org Artificial Intelligence

This paper identifies a critical yet underexplored challenge in distilling from multi-modal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories. Guided by concept drift, we introduce the "learn-compare-critique" paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model. Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/


Learning-Based Hashing for ANN Search: Foundations and Early Advances

arXiv.org Artificial Intelligence

Approximate Nearest Neighbour (ANN) search is a fundamental problem in information retrieval, underpinning large-scale applications in computer vision, natural language processing, and cross-modal search. Hashing-based methods provide an efficient solution by mapping high-dimensional data into compact binary codes that enable fast similarity computations in Hamming space. Over the past two decades, a substantial body of work has explored learning to hash, where projection and quantisation functions are optimised from data rather than chosen at random. This article offers a foundational survey of early learning-based hashing methods, with an emphasis on the core ideas that shaped the field. We review supervised, unsupervised, and semi-supervised approaches, highlighting how projection functions are designed to generate meaningful embeddings and how quantisation strategies convert these embeddings into binary codes. We also examine extensions to multi-bit and multi-threshold models, as well as early advances in cross-modal retrieval. Rather than providing an exhaustive account of the most recent methods, our goal is to introduce the conceptual foundations of learning-based hashing for ANN search. By situating these early models in their historical context, we aim to equip readers with a structured understanding of the principles, trade-offs, and open challenges that continue to inform current research in this area.


Wasserstein projection distance for fairness testing of regression models

arXiv.org Artificial Intelligence

Fairness in machine learning is a critical concern, yet most research has focused on classification tasks, leaving regression models underexplored. This paper introduces a Wasserstein projection-based framework for fairness testing in regression models, focusing on expectation-based criteria. We propose a hypothesis-testing approach and an optimal data perturbation method to improve fairness while balancing accuracy. Theoretical results include a detailed categorization of fairness criteria for regression, a dual reformulation of the Wasserstein projection test statistic, and the derivation of asymptotic bounds and limiting distributions. Experiments on synthetic and real-world datasets demonstrate that the proposed method offers higher specificity compared to permutation-based tests, and effectively detects and mitigates biases in real applications such as student performance and housing price prediction.


Efficient Training of Spiking Neural Networks by Spike-aware Data Pruning

arXiv.org Artificial Intelligence

Spiking neural networks (SNNs), recognized as an energy-efficient alternative to traditional artificial neural networks (ANNs), have advanced rapidly through the scaling of models and datasets. However, such scaling incurs considerable training overhead, posing challenges for researchers with limited computational resources and hindering the sustained development of SNNs. Data pruning is a promising strategy for accelerating training by retaining the most informative examples and discarding redundant ones, but it remains largely unexplored in SNNs. Directly applying ANN-based data pruning methods to SNNs fails to capture the intrinsic importance of examples and suffers from high gradient variance. To address these challenges, we propose a novel spike-aware data pruning (SADP) method. SADP reduces gradient variance by determining each example's selection probability to be proportional to its gradient norm, while avoiding the high cost of direct gradient computation through an efficient upper bound, termed spike-aware importance score. This score accounts for the influence of all-or-nothing spikes on the gradient norm and can be computed with negligible overhead. Extensive experiments across diverse datasets and architectures demonstrate that SADP consistently outperforms data pruning baselines and achieves training speedups close to the theoretical maxima at different pruning ratios. Notably, SADP reduces training time by 35% on ImageNet while maintaining accuracy comparable to that of full-data training. This work, therefore, establishes a data-centric paradigm for efficient SNN training and paves the way for scaling SNNs to larger models and datasets. The source code will be released publicly after the review process.


Using predefined vector systems as latent space configuration for neural network supervised training on data with arbitrarily large number of classes

arXiv.org Artificial Intelligence

Supervised learning (SL) methods are indispensable for neural network (NN) training used to perform classification tasks. While resulting in very high accuracy, SL training often requires making NN parameter number dependent on the number of classes, limiting their applicability when the number of classes is extremely large or unknown in advance. In this paper we propose a methodology that allows one to train the same NN architecture regardless of the number of classes. This is achieved by using predefined vector systems as the target latent space configuration (LSC) during NN training. We discuss the desired properties of target configurations and choose randomly perturbed vectors of An root system for our experiments. These vectors are used to successfully train encoders and visual transformers (ViT) on Cinic-10 and ImageNet-1K in low- and high-dimensional cases by matching NN predictions with the predefined vectors. Finally, ViT is trained on a dataset with 1.28 million classes illustrating the applicability of the method to training on datasets with extremely large number of classes. In addition, potential applications of LSC in lifelong learning and NN distillation are discussed illustrating versatility of the proposed methodology.


FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

arXiv.org Artificial Intelligence

Large language models (LLMs) increasingly rely on Chain-of-Thought (CoT) prompting to improve problem-solving and provide seemingly transparent explanations. However, growing evidence shows that CoT often fail to faithfully represent the underlying reasoning process, raising concerns about their reliability in high-risk applications. Although prior studies have focused on mechanism-level analyses showing that CoTs can be unfaithful, they leave open the practical challenge of deciding whether a specific trajectory is faithful to the internal reasoning of the model. To address this gap, we introduce FaithCoT-Bench, a unified benchmark for instance-level CoT unfaithfulness detection. Our framework establishes a rigorous task formulation that formulates unfaithfulness detection as a discriminative decision problem, and provides FINE-CoT (Faithfulness instance evaluation for Chain-of-Thought), an expert-annotated collection of over 1,000 trajectories generated by four representative LLMs across four domains, including more than 300 unfaithful instances with fine-grained causes and step-level evidence. We further conduct a systematic evaluation of eleven representative detection methods spanning counterfactual, logit-based, and LLM-as-judge paradigms, deriving empirical insights that clarify the strengths and weaknesses of existing approaches and reveal the increased challenges of detection in knowledge-intensive domains and with more advanced models. To the best of our knowledge, FaithCoT-Bench establishes the first comprehensive benchmark for instance-level CoT faithfulness, setting a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.


LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and visuals. This survey presents the first comprehensive, lifecycle-aligned taxonomy of data science agents, systematically analyzing and mapping forty-five systems onto the six stages of the end-to-end data science process: business understanding and data acquisition, exploratory analysis and visualization, feature engineering, model building and selection, interpretation and explanation, and deployment and monitoring. In addition to lifecycle coverage, we annotate each agent along five cross-cutting design dimensions: reasoning and planning style, modality integration, tool orchestration depth, learning and alignment methods, and trust, safety, and governance mechanisms. Beyond classification, we provide a critical synthesis of agent capabilities, highlight strengths and limitations at each stage, and review emerging benchmarks and evaluation practices. Our analysis identifies three key trends: most systems emphasize exploratory analysis, visualization, and modeling while neglecting business understanding, deployment, and monitoring; multimodal reasoning and tool orchestration remain unresolved challenges; and over 90% lack explicit trust and safety mechanisms. We conclude by outlining open challenges in alignment stability, explainability, governance, and robust evaluation frameworks, and propose future research directions to guide the development of robust, trustworthy, low-latency, transparent, and broadly accessible data science agents.


Distilling Reasoning into Student LLMs: Local Naturalness for Selecting Teacher Data

arXiv.org Artificial Intelligence

Distilling long reasoning traces (10K+ tokens) from stronger teacher models into smaller student LLMs via SFT has emerged as a standard paradigm. This approach is practical and efficient: it leverages the ease of generating abundant reasoning data from stronger models and provides a direct, data-driven way to teach less capable models better reasoning. While previous work has largely focused on prompt selection with responses from a single teacher, the equally important problem of choosing the best response when multiple teacher outputs are available for a single prompt remains underexplored. This challenge becomes important in a multi-teacher setting, where different students may benefit from the outputs of different teachers. This paper fills that gap with a systematic study of response selection for reasoning distillation. We first show that the current method, which picks responses the student assigns the highest global log-probability (global naturalness), fails when responses come from multiple teachers, i.e., global naturalness no longer correlates with downstream performance, especially as the reasoning traces from strong teachers become longer. To overcome this problem, we introduce Local Naturalness, which measures the student's log-probabilities over short, sequential reasoning steps conditioned only on a small local window. Local Naturalness enables two applications: 1) Teacher Selection: Aggregating local scores across prompts reliably identifies the most helpful teacher. 2) Response Selection from a Multiple Teachers: When mixing answers from many teachers, Local Naturalness boosts a 32B student's accuracy on math benchmarks by 9.4pp over global selection, also surpassing the performance achieved by training on data from the single best teacher. These results highlight the power of localized data quality evaluation and data mixing for more effective reasoning distillation.


Transductive and Learning-Augmented Online Regression

arXiv.org Artificial Intelligence

Motivated by the predictable nature of real-life in data streams, we study online regression when the learner has access to predictions about future examples. In the extreme case, called transductive online learning, the sequence of examples is revealed to the learner before the game begins. For this setting, we fully characterize the minimax expected regret in terms of the fat-shattering dimension, establishing a separation between transductive online regression and (adversarial) online regression. Then, we generalize this setting by allowing for noisy or \emph{imperfect} predictions about future examples. Using our results for the transductive online setting, we develop an online learner whose minimax expected regret matches the worst-case regret, improves smoothly with prediction quality, and significantly outperforms the worst-case regret when future example predictions are precise, achieving performance similar to the transductive online learner. This enables learnability for previously unlearnable classes under predictable examples, aligning with the broader learning-augmented model paradigm.


PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian

arXiv.org Artificial Intelligence

This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi-turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on-device deployment, enabling use without external servers. Development followed a three-stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning-oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets-PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark against multiple baselines. Experiments compared simple prompting, multi-agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM-as-a-judge evaluation and was ranked highest by human evaluators in a single-turn preference study. In multi-turn tests with PsychoLexDialogue, the long-term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.