Goto

Collaborating Authors

 Education


An energy-efficient spiking neural network with continuous learning for self-adaptive brain-machine interface

arXiv.org Artificial Intelligence

The number of simultaneously recorded neurons follows an exponentially increasing trend in implantable brain-machine interfaces (iBMIs). Integrating the neural decoder in the implant is an effective data compression method for future wireless iBMIs. However, the non-stationarity of the system makes the performance of the decoder unreliable. To avoid frequent retraining of the decoder and to ensure the safety and comfort of the iBMI user, continuous learning is essential for real-life applications. Since Deep Spiking Neural Networks (DSNNs) are being recognized as a promising approach for developing resource-efficient neural decoder, we propose continuous learning approaches with Reinforcement Learning (RL) algorithms adapted for DSNNs. Banditron and AGREL are chosen as the two candidate RL algorithms since they can be trained with limited computational resources, effectively addressing the non-stationary problem and fitting the energy constraints of implantable devices. To assess the effectiveness of the proposed methods, we conducted both open-loop and closed-loop experiments. The accuracy of open-loop experiments conducted with DSNN Banditron and DSNN AGREL remains stable over extended periods. Meanwhile, the time-to-target in the closed-loop experiment with perturbations, DSNN Banditron performed comparably to that of DSNN AGREL while achieving reductions of 98% in memory access usage and 99% in the requirements for multiply- and-accumulate (MAC) operations during training. Compared to previous continuous learning SNN decoders, DSNN Banditron requires 98% less computes making it a prime candidate for future wireless iBMI systems.


Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

arXiv.org Artificial Intelligence

Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/ฯ„$ rate, where $ฯ„$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.


Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

arXiv.org Artificial Intelligence

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.


Reducing research bureaucracy in UK higher education: Can generative AI assist with the internal evaluation of quality?

arXiv.org Artificial Intelligence

This paper examines the potential for generative artificial intelligence (GenAI) to assist with internal review processes for research quality evaluations in UK higher education and particularly in preparation for the Research Excellence Framework (REF). Using the lens of function substitution in the Viable Systems Model, we present an experimental methodology using ChatGPT to score and rank business and management papers from REF 2021 submissions, "reverse engineering" the assessment by comparing AI-generated scores with known institutional results. Through rigourous testing of 822 papers across 11 institutions, we established scoring boundaries that aligned with reported REF outcomes: 49% between 1* and 2*, 59% between 2* and 3*, and 69% between 3* and 4*. The results demonstrate that AI can provide consistent evaluations that help identify borderline evaluation cases requiring additional human scrutiny while reducing the substantial resource burden of traditional internal review processes. We argue for application through a nuanced hybrid approach that maintains academic integrity while addressing the multi-million pound costs associated with research evaluation bureaucracy. While acknowledging these limitations including potential AI biases, the research presents a promising framework for more efficient, consistent evaluations that could transform current approaches to research assessment.


EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

arXiv.org Artificial Intelligence

With the growing use of Large Language Model (LLM)- based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce EduMod-LLM, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment.


Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

arXiv.org Artificial Intelligence

Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.


Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach

arXiv.org Artificial Intelligence

Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model's predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.


A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks

arXiv.org Artificial Intelligence

In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns. Existing approaches either rely on elaborate prompt engineering, whose effectiveness is heavily dependent on human experience, or integrate policy networks and pre-trained policy models, which are usually difficult to adapt to new dialogue scenarios and costly to train. Therefore, in this paper, we present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method that completely avoids specific model training by utilizing a Large Language Model (LLM) to simulate behaviors of user and system at the same time. Specifically, NRPA-GD constructs a complete evaluation mechanism for dialogue trajectories and employs an optimization framework of nested Monte Carlo simulation and policy self-adaptation to dynamically adjust policies during the dialogue process. The experimental results on four typical goal-oriented dialogue datasets show that NRPA-GD outperforms both existing prompt engineering and specifically pre-trained model-based methods. Impressively, NRPA-GD surpasses ChatGPT and pre-trained policy models with only a 0.6-billion-parameter LLM. The proposed approach further demonstrates the advantages and novelty of employing planning methods on LLMs to solve practical planning tasks.


Qwen3-VL Technical Report

arXiv.org Artificial Intelligence

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.


Human Experts' Evaluation of Generative AI for Contextualizing STEAM Education in the Global South

arXiv.org Artificial Intelligence

STEAM education in many parts of the Global South remains abstract and weakly connected to learners sociocultural realities. This study examines how human experts evaluate the capacity of Generative AI (GenAI) to contextualize STEAM instruction in these settings. Using a convergent mixed-methods design grounded in human-centered and culturally responsive pedagogy, four STEAM education experts reviewed standardized Ghana NaCCA lesson plans and GenAI-generated lessons created with a customized Culturally Responsive Lesson Planner (CRLP). Quantitative data were collected with a validated 25-item Culturally Responsive Pedagogy Rubric assessing bias awareness, cultural representation, contextual relevance, linguistic responsiveness, and teacher agency. Qualitative reflections provided additional insight into the pedagogical and cultural dynamics of each lesson. Findings show that GenAI, especially through the CRLP, improved connections between abstract standards and learners lived experiences. Teacher Agency was the strongest domain, while Cultural Representation was the weakest. CRLP-generated lessons were rated as more culturally grounded and pedagogically engaging. However, GenAI struggled to represent Ghana's cultural diversity, often producing surface-level references, especially in Mathematics and Computing. Experts stressed the need for teacher mediation, community input, and culturally informed refinement of AI outputs. Future work should involve classroom trials, broader expert participation, and fine-tuning with Indigenous corpora.