Education
Effect of Performance Feedback Timing on Motor Learning for a Surgical Training Task
Gale, Mary Kate, Baker-Matsuoka, Kailana, Nisky, Ilana, Okamura, Allison
Objective: Robot-assisted minimally invasive surgery (RMIS) has become the gold standard for a variety of surgical procedures, but the optimal method of training surgeons for RMIS is unknown. We hypothesized that real-time, rather than post-task, error feedback would better increase learning speed and reduce errors. Methods: Forty-two surgical novices learned a virtual version of the ring-on-wire task, a canonical task in RMIS training. We investigated the impact of feedback timing with multi-sensory (haptic and visual) cues in three groups: (1) real-time error feedback, (2) trial replay with error feedback, and (3) no error feedback. Results: Participant performance was evaluated based on the accuracy of ring position and orientation during the task. Participants who received real-time feedback outperformed other groups in ring orientation. Additionally, participants who received feedback in replay outperformed participants who did not receive any error feedback on ring orientation during long, straight path sections. There were no significant differences between groups for ring position overall, but participants who received real-time feedback outperformed the other groups in positional accuracy on tightly curved path sections. Conclusion: The addition of real-time haptic and visual error feedback improves learning outcomes in a virtual surgical task over error feedback in replay or no error feedback at all. Significance: This work demonstrates that multi-sensory error feedback delivered in real time leads to better training outcomes as compared to the same feedback delivered after task completion. This novel method of training may enable surgical trainees to develop skills with greater speed and accuracy.
Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling
Liao, Ning, Wang, Xiaoxing, Lin, Zehao, Guo, Weiyang, Hong, Feng, Song, Shixiang, Yu, Geng, Zhao, Zihua, Xie, Sitao, Wei, Longxuan, Jin, Xiangqi, Qin, Xiaohan, Ma, Jiale, Chen, Kai, Yao, Jiangchao, Lin, Zhouhan, Yan, Junchi, Li, Zhiyu, Xiong, Feiyu, Wang, Yanfeng, Zhang, Linfeng
A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.
Cascading Adversarial Bias from Injection to Distillation in Language Models
Chaudhari, Harsh, Hayes, Jamie, Jagielski, Matthew, Shumailov, Ilia, Nasr, Milad, Oprea, Alina
Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Noh, Dongwon, Koh, Donghyeok, Yuk, Junghun, Kim, Gyuwan, Lee, Jaeyong, Lim, Kyungtae, Park, Cheoneum
Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Wu, Yuanchen, Verma, Saurabh, Lee, Justin, Xiong, Fangzhou, Zhang, Poppy, Awadelkarim, Amel, Chen, Xu, Yuan, Yubai, Hill, Shawndra
Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
The race to make the perfect baby is creating an ethical mess
A new field of science claims to be able to predict aesthetic traits, intelligence, and even moral character in embryos. Is this the next step in human evolution or something more dangerous? Consider, if you will, the translucent blob in the eye of a microscope: a human blastocyst, the biological specimen that emerges just five days or so after a fateful encounter between egg and sperm. This bundle of cells, about the size of a grain of sand pulled from a powdery white Caribbean beach, contains the coiled potential of a future life: 46 chromosomes, thousands of genes, and roughly six billion base pairs of DNA--an instruction manual to assemble a one-of-a-kind human. Now imagine a laser pulse snipping a hole in the blastocyst's outermost shell so a handful of cells can be suctioned up by a microscopic pipette. This is the moment, thanks to advances in genetic sequencing technology, when it becomes possible to read virtually that entire instruction manual. An emerging field of science seeks to use the analysis pulled from that procedure to predict what kind of a person that embryo might become. Some parents turn to these tests to avoid passing on devastating genetic disorders that run in their families. A much smaller group, driven by dreams of Ivy League diplomas or attractive, well-behaved offspring, are willing to pay tens of thousands of dollars to optimize for intelligence, appearance, and personality. Some of the most eager early boosters of this technology are members of the Silicon Valley elite, including tech billionaires like Elon Musk, Peter Thiel, and Coinbase CEO Brian Armstrong. Embryo selection is less like a build-a-baby workshop and more akin to a store where parents can shop for their future children from several available models--complete with stat cards. But customers of the companies emerging to provide it to the public may not be getting what they're paying for. Genetics experts have been highlighting the potential deficiencies of this testing for years.
Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale
Barajeeh, Basad, Yankouskaya, Ala, AlShakhsi, Sameha, Ho, Chun Sing Maxwell, Xu, Guandong, Ali, Raian
As the use of large language models (LLMs) becomes increasingly global, understanding public attitudes toward these systems requires tools that are adapted to local contexts and languages. In the Arab world, LLM adoption has grown rapidly with both globally dominant platforms and regional ones like Fanar and Jais offering Arabic-specific solutions. This highlights the need for culturally and linguistically relevant scales to accurately measure attitudes toward LLMs in the region. Tools assessing attitudes toward artificial intelligence (AI) can provide a base for measuring attitudes specific to LLMs. The 5-item Attitudes Toward Artificial Intelligence (ATAI) scale, which measures two dimensions, the AI Fear and the AI Acceptance, has been recently adopted and adapted to develop new instruments in English using a sample from the UK: the Attitudes Toward General LLMs (AT-GLLM) and Attitudes Toward Primary LLM (AT-PLLM) scales. In this paper, we translate the two scales, AT-GLLM and AT-PLLM, and validate them using a sample of 249 Arabic-speaking adults. The results show that the scale, translated into Arabic, is a reliable and valid tool that can be used for the Arab population and language. Psychometric analyses confirmed a two-factor structure, strong measurement invariance across genders, and good internal reliability. The scales also demonstrated strong convergent and discriminant validity. Our scales will support research in a non-Western context, a much-needed effort to help draw a global picture of LLM perceptions, and will also facilitate localized research and policy-making in the Arab region.
Personalized Learning Path Planning with Goal-Driven Learner State Modeling
Lim, Joy Jia Yin, He, Ye, Yu, Jifan, Cong, Xin, Zhang-Li, Daniel, Liu, Zhiyuan, Liu, Huiqin, Hou, Lei, Li, Juanzi, Xu, Bin
Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore's effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.
Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
Bhattacharya, Antara Raaghavi, Papadimitriou, Isabel, Davidson, Kathryn, Alvarez-Melis, David
Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc., as in "twenty + three"). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.
The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology
Al-Khalifa, Shahad, Durrani, Nadir, Al-Khalifa, Hend, Alam, Firoj
The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.