Education
Towards Unsupervised Speech Recognition at the Syllable-Level
Wang, Liming, Ni, Junrui, Chang, Kai-Wei, Bhati, Saurabhchand, Harwath, David, Hasegawa-Johnson, Mark, Glass, James R.
Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.
MECKD: Deep Learning-Based Fall Detection in Multilayer Mobile Edge Computing With Knowledge Distillation
Mao, Wei-Lung, Wang, Chun-Chi, Chou, Po-Heng, Liu, Kai-Chun, Tsao, Yu
The rising aging population has increased the importance of fall detection (FD) systems as an assistive technology, where deep learning techniques are widely applied to enhance accuracy. FD systems typically use edge devices (EDs) worn by individuals to collect real-time data, which are transmitted to a cloud center (CC) or processed locally. However, this architecture faces challenges such as a limited ED model size and data transmission latency to the CC. Mobile edge computing (MEC), which allows computations at MEC servers deployed between EDs and CC, has been explored to address these challenges. We propose a multilayer MEC (MLMEC) framework to balance accuracy and latency. The MLMEC splits the architecture into stations, each with a neural network model. If front-end equipment cannot detect falls reliably, data are transmitted to a station with more robust back-end computing. The knowledge distillation (KD) approach was employed to improve front-end detection accuracy by allowing high-power back-end stations to provide additional learning experiences, enhancing precision while reducing latency and processing loads. Simulation results demonstrate that the KD approach improved accuracy by 11.65% on the SisFall dataset and 2.78% on the FallAllD dataset. The MLMEC with KD also reduced the data latency rate by 54.15% on the FallAllD dataset and 46.67% on the SisFall dataset compared to the MLMEC without KD. In summary, the MLMEC FD system exhibits improved accuracy and reduced latency.
Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study
Lee, Chu-Hsuan, Chang, Chen-Chi, Lee, Hung-Shin, Hsu, Yun-Hsiang, Chen, Ching-Yuan
With many endangered languages at risk of disappearing, efforts to preserve them now rely more than ever on using technology alongside culturally informed teaching strategies. This study examines user behaviors in TALKA, a generative AI-powered chatbot designed for Hakka language engagement, by employing a dual-layered analytical framework grounded in Bloom's Taxonomy of cognitive processes and dialogue act categorization. We analyzed 7,077 user utterances, each carefully annotated according to six cognitive levels and eleven dialogue act types. These included a variety of functions, such as asking for information, requesting translations, making cultural inquiries, and using language creatively. Pragmatic classifications further highlight how different types of dialogue acts--such as feedback, control commands, and social greetings--align with specific cognitive intentions. The results suggest that generative AI chatbots can support language learning in meaningful ways--especially when they are designed with an understanding of how users think and communicate. They may also help learners express themselves more confidently and connect with their cultural identity. The TALKA case provides empirical insights into how AI-mediated dialogue facilitates cognitive development in low-resource language learners, as well as pragmatic negotiation and socio-cultural affiliation. By focusing on AI-assisted language learning, this study offers new insights into how technology can support language preservation and educational practice.
ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
Ho, Matthew, Si, Chen, Feng, Zhaoxiang, Yu, Fangxu, Yang, Yichi, Liu, Zhijian, Hu, Zhiting, Qin, Lianhui
While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.
SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation
Tang, Xiaqiang, Wang, Yi, Hu, Keyu, Xu, Rui, Li, Chuang, Sun, Weigao, Li, Jian, Xie, Sihong
Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO
OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
Merx, Raphaรซl, Suominen, Hanna, Cohn, Trevor, Vylomova, Ekaterina
In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Aggarwal, Pranjal, Kim, Seungone, Lanchantin, Jack, Welleck, Sean, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep
Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
Han, Jinyi, Wang, Xinyi, Zhao, Haiquan, li, Tingyun, Jiang, Zishang, Jiang, Sihang, Liang, Jiaqing, Lin, Xin, Zhou, Weikang, Sun, Zeye, Yu, Fei, Xiao, Yanghua
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and baselines used in the paper are available on GitHub.
Cross-Modal Distillation For Widely Differing Modalities
Zhao, Cairong, Jin, Yufeng, Song, Zifan, Chen, Haonan, Miao, Duoqian, Hu, Guosheng
Abstract--Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. T o solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. T o address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively . In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech. The rapid advancement of deep learning has revolutionized numerous fields by enabling the development of increasingly complex and powerful models. However, as model sizes continue to grow, the marginal benefits of scaling up models diminish, prompting researchers to explore alternative strategies for improving performance. One such strategy is multi-modal learning, which leverages the complementary strengths of multiple data modalities--such as images, speech, and text--to enhance task performance. While multi-modal learning has shown promise in various applications, its practical adoption is often hindered by the high cost and complexity of acquiring and processing multi-modal data. This limitation raises a critical question: how can we effectively utilize multi-modal data during training when only uni-modal data is available during deployment? T o address this challenge, we propose a novel framework for cross-modal knowledge distillation, which enables the transfer of knowledge from a strong modality (e.g., images) to a weak modality (e.g., speech) during training, even when only the weak modality is available during inference.
Refactoring Codebases through Library Design
Kovacic, Ziga, Chiu, Justin T., Lee, Celine, Zhao, Wenting, Ellis, Kevin
Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.