Instructional Material
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Wen, Xueru, Lou, Jie, Li, Zichao, Lu, Yaojie, Yu, Xing, Ji, Yuqiu, Xu, Guohai, Lin, Hongyu, He, Ben, Han, Xianpei, Sun, Le, Zhang, Debing
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
MedSimAI: Simulation and Formative Feedback Generation to Enhance Deliberate Practice in Medical Education
Hicke, Yann, Geathers, Jadon, Rajashekar, Niroop, Chan, Colleen, Jack, Anyanate Gwendolyne, Sewell, Justin, Preston, Mackenzi, Cornes, Susannah, Shung, Dennis, Kizilcec, Rene
Medical education faces challenges in scalability, accessibility, and consistency, particularly in clinical skills training for physician-patient communication. Traditional simulation-based learning, while effective, is resource-intensive, difficult to schedule, and often highly variable in feedback quality. Through a collaboration between AI, learning science, and medical education experts, we co-developed MedSimAI, an AI-powered simulation platform that enables deliberate practice, self-regulated learning (SRL), and automated assessment through interactive patient encounters. Leveraging large language models (LLMs), MedSimAI generates realistic clinical interactions and provides immediate, structured feedback using established medical evaluation frameworks such as the Master Interview Rating Scale (MIRS). In a pilot study with 104 first-year medical students, we examined engagement, conversation patterns, and user perceptions. Students found MedSimAI beneficial for repeated, realistic patient-history practice. Conversation analysis revealed that certain higher-order skills were often overlooked, though students generally performed systematic histories and empathic listening. By integrating unlimited practice opportunities, real-time AI assessment, and SRL principles, MedSimAI addresses key limitations of traditional simulation-based training, making high-quality clinical education more accessible and scalable.
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos
Wei, Kangda, Zhou, Zhengyu, Wang, Bingqing, Araki, Jun, Lange, Lukas, Huang, Ruihong, Feng, Zhe
In recent years, online lecture videos have become an increasingly popular resource for acquiring new knowledge. Systems capable of effectively understanding/indexing lecture videos are thus highly desirable, enabling downstream tasks like question answering to help users efficiently locate specific information within videos. This work proposes PreMind, a novel multi-agent multimodal framework that leverages various large models for advanced understanding/indexing of presentation-style videos. PreMind first segments videos into slide-presentation segments using a Vision-Language Model (VLM) to enhance modern shot-detection techniques. Each segment is then analyzed to generate multimodal indexes through three key steps: (1) extracting slide visual content, (2) transcribing speech narratives, and (3) consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis. Compared to traditional video indexing methods, PreMind captures rich, reliable multimodal information, allowing users to search for details like abbreviations shown only on slides. Systematic evaluations on the public LPM dataset and an internal enterprise dataset are conducted to validate PreMind's effectiveness, supported by detailed analyses.
Experiences with Content Development and Assessment Design in the Era of GenAI
Sharma, Aakanksha, Shailendra, Samar, Kadel, Rajan
Generative Artificial Intelligence (GenAI) has the potential to transform higher education by generating human-like content. The advancement in GenAI has revolutionised several aspects of education, especially subject and assessment design. In this era, it is crucial to design assessments that challenge students and cannot be solved using GenAI tools. This makes it necessary to update the educational content with rapidly evolving technology. The assessment plays a significant role in ensuring the students learning, as it encourages students to engage actively, leading to the achievement of learning outcomes. The paper intends to determine how effectively GenAI can design a subject, including lectures, labs and assessments, using prompts and custom-based training. This paper aims to elucidate the direction to educators so they can leverage GenAI to create subject content. Additionally, we provided our experiential learning for educators to develop content, highlighting the importance of prompts and fine-tuning to ensure output quality. It has also been observed that expert evaluation is essential for assessing the quality of GenAI-generated materials throughout the content generation process.
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Kumar, Komal, Ashraf, Tajamul, Thawakar, Omkar, Anwer, Rao Muhammad, Cholakkal, Hisham, Shah, Mubarak, Yang, Ming-Hsuan, Torr, Phillip H. S., Khan, Salman, Khan, Fahad Shahbaz
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
XAIxArts Manifesto: Explainable AI for the Arts
Bryan-Kinns, Nick, Zheng, Shuoyang Jasper, Castro, Francisco, Lewis, Makayla, Chang, Jia-Rey, Vigliensoni, Gabriel, Broad, Terence, Clemens, Michael, Wilson, Elizabeth
Explainable AI (XAI) is concerned with how to make AI models more understandable to people. To date these explanations have predominantly been technocentric - mechanistic or productivity oriented. This paper introduces the Explainable AI for the Arts (XAIxArts) manifesto to provoke new ways of thinking about explainability and AI beyond technocentric discourses. Manifestos offer a means to communicate ideas, amplify unheard voices, and foster reflection on practice. To supports the co-creation and revision of the XAIxArts manifesto we combine a World Caf\'e style discussion format with a living manifesto to question four core themes: 1) Empowerment, Inclusion, and Fairness; 2) Valuing Artistic Practice; 3) Hacking and Glitches; and 4) Openness. Through our interactive living manifesto experience we invite participants to actively engage in shaping this XIAxArts vision within the CHI community and beyond.
MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Drechsel, Jonathan, Reusch, Anja, Herbold, Steffen
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation, which can be used to train language models with enhanced mathematical embeddings.
Applications of Statistical Field Theory in Deep Learning
Ringel, Zohar, Rubin, Noa, Mor, Edo, Helias, Moritz, Seroussi, Inbar
Deep learning algorithms have made incredible strides in the past decade yet due to the complexity of these algorithms, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of deep learning within the physics paradigm. As deep learning is largely about learning functions and distributions over functions, statistical field theory, a rich and versatile toolbox for tackling complex distributions over functions (fields) is an obvious choice of formalism. Research efforts carried out in the past few years have demonstrated the ability of field theory to provide useful insights on generalization, implicit bias, and feature learning effects. Here we provide a pedagogical review of this emerging line of research.
Enhancing Collaborative Filtering-Based Course Recommendations by Exploiting Time-to-Event Information with Survival Analysis
Gharahighehi, Alireza, Ghinis, Achilleas, Venturini, Michela, Cornillie, Frederik, Vens, Celine
These authors contributed equally to this work. Abstract Massive Open Online Courses (MOOCs) are emerging as a popular alternative to traditional education, offering learners the flexibility to access a wide range of courses from various disciplines, anytime and anywhere. To enhance learner engagement, it is crucial to recommend courses that align with their preferences and needs. Course Recommender Systems (RSs) can play an important role in this by modeling learners' preferences based on their previous interactions within the MOOC platform. Time-to-dropout and time-to-completion in MOOCs, like other time-to-event prediction tasks, can be effectively modeled using survival analysis (SA) methods. In this study, we apply SA methods to improve collaborative filtering recommendation performance by considering time-to-event in the context of MOOCs. The findings underscore the potential of integrating SA methods with RSs to enhance personalization in MOOCs. Keywords: recommendation systems, survival analysis, massive open online course, personalized learning, dropout 1 Introduction Massive Open Online Courses (MOOCs) platforms offer a diverse range of online courses to learners around the globe, promoting equitable education by breaking down barriers related to geography and time.
Artificial Intelligence in Sports: Insights from a Quantitative Survey among Sports Students in Germany about their Perceptions, Expectations, and Concerns regarding the Use of AI Tools
Krämer, Dennis, Bosold, Anja, Minarik, Martin, Schyvinck, Cleo, Hajek, Andre
Generative Artificial Intelligence (AI) tools such as ChatGPT, Copilot, or Gemini have a crucial impact on academic research and teaching. Empirical data on how students perceive the increasing influence of AI, which different types of tools they use, what they expect from them in their daily academic tasks, and their concerns regarding the use of AI in their studies are still limited. The manuscript presents findings from a quantitative survey conducted among sports students of all semesters in Germany using an online questionnaire. It explores aspects such as students' usage behavior, motivational factors, and uncertainties regarding the impact of AI tools on academia in the future. Furthermore, the social climate in sports studies is being investigated to provide a general overview of the current situation of the students in Germany. Data collection took place between August and November 2023, addressing all sports departments at German universities, with a total of 262 students participating. Our Findings indicate that students have a strong interest in using AI tools in their studies, expecting them to improve their overall academic performance, understand the complexity of scientific approaches, and save time. They express confidence that the proliferation of AI will not compromise their critical thinking skills. Moreover, students are positive about integrating more AI-related topics into the curriculum and about lecturers adopting more AI-based teaching methods. However, our findings also show that students have concerns about plagiarism, lecturer preparedness and their own skills and future skill development.