Instructional Material
AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search
Bi, Shuzhen, Song, Chang, Song, Siyu, Lv, Jinze, Chen, Jian, Wang, Xinyun, Zhou, Aimin, Hao, Hao
Four-Period Detailed Design Period 1: Topic Selection and Initial Exploration Period 2: Principle Analysis and Model Design Period 3: Model Construction and Refinement Period 4: "Historical Technology Expo" with presentations [Includes detailed student reflection prompts, extension activities, and troubleshooting guidance...] Base Model: Generic Outline Interdisciplinary Lesson Plan Design Learning Objectives: Help students understand how physics influences historical progress... Cultivate ability to analyze social factors behind technological development... Class Schedule: Four periods covering physics review, historical technologies, case study, and modern applications. Assessment: Class participation, group reports, reflection journals [Subsequent periods contain only high-level bullet points without actionable details...] 12 Qualitative Analysis This comparison reveals dramatic capability differences for complex generation tasks. The Base Model produces only a generic outline with vague bullet points--entirely insufficient for classroom use. Both AutoSynth and Expert-Designed models generate outstanding, comprehensive lesson plans with detailed objectives, granular activities, and sophisticated assessment schemes. The subtle differences reflect their optimization processes: AutoSynth emphasizes systematic difficulty coverage (likely from iterative refinement), while Expert-Designed showcases deep assessment design expertise. Both represent quantum leaps over baseline, validating that specialized workflows-- automated or manual--are essential for professional-grade content. This supports our quantitative findings (Table 1): while Au-toSynth achieves lower human preference (51 percent vs 96 percent), it produces genuinely high-quality outputs far superior to baseline capabilities.
Unsupervised Feature Selection Through Group Discovery
Lifshitz, Shira, Lindenbaum, Ofir, Mishne, Gal, Meir, Ron, Benisty, Hadas
Unsupervised feature selection (FS) is essential for high-dimensional learning tasks where labels are not available. It helps reduce noise, improve generalization, and enhance in-terpretability. However, most existing unsupervised FS methods evaluate features in isolation, even though informative signals often emerge from groups of related features. For example, adjacent pixels, functionally connected brain regions, or correlated financial indicators tend to act together, making independent evaluation suboptimal. Although some methods attempt to capture group structure, they typically rely on predefined partitions or label supervision, limiting their applicability. We propose GroupFS, an end-to-end, fully differentiable framework that jointly discovers latent feature groups and selects the most informative groups among them, without relying on fixed a priori groups or label supervision. GroupFS enforces Laplacian smoothness on both feature and sample graphs and applies a group sparsity regu-larizer to learn a compact, structured representation. Across nine benchmarks spanning images, tabular data, and biological datasets, GroupFS consistently outperforms state-of-the-art unsupervised FS in clustering and selects groups of features that align with meaningful patterns.
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Tan, Weihao, Li, Xiangyang, Fang, Yunhao, Yao, Heyuan, Yan, Shi, Luo, Hao, Ao, Tenglong, Li, Huihui, Ren, Hongbin, Yi, Bairen, Qin, Yujia, An, Bo, Liu, Libin, Shi, Guang
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLM's Audio Overviews
This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.
Hope, Aspirations, and the Impact of LLMs on Female Programming Learners in Afghanistan
Behmanush, Hamayoon, Akhtari, Freshta, Nooripour, Roghieh, Weber, Ingmar, Cannanure, Vikram Kamath
Designing impactful educational technologies in contexts of socio-political instability requires a nuanced understanding of educational aspirations. Currently, scalable metrics for measuring aspirations are limited. This study adapts, translates, and evaluates Snyder's Hope Scale as a metric for measuring aspirations among 136 women learning programming online during a period of systemic educational restrictions in Afghanistan. The adapted scale demonstrated good reliability (Cronbach's ฮฑ = 0.78) and participants rated it as understandable and relevant. While overall aspiration-related scores did not differ significantly by access to Large Language Models (LLMs), those with access reported marginally higher scores on the Avenues subscale (p = .056), suggesting broader perceived pathways to achieving educational aspirations. These findings support the use of the adapted scale as a metric for aspirations in contexts of socio-political instability. More broadly, the adapted scale can be used to evaluate the impact of aspiration-driven design of educational technologies.
Infinite-Dimensional Operator/Block Kaczmarz Algorithms: Regret Bounds and $ฮป$-Effectiveness
Jeong, Halyun, Jorgensen, Palle E. T., Kwon, Hyun-Kyoung, Song, Myung-Sin
We present a variety of projection-based linear regression algorithms with a focus on modern machine-learning models and their algorithmic performance. We study the role of the relaxation parameter in generalized Kaczmarz algorithms and establish a priori regret bounds with explicit $ฮป$-dependence to quantify how much an algorithm's performance deviates from its optimal performance. A detailed analysis of relaxation parameter is also provided. Applications include: explicit regret bounds for the framework of Kaczmarz algorithm models, non-orthogonal Fourier expansions, and the use of regret estimates in modern machine learning models, including for noisy data, i.e., regret bounds for the noisy Kaczmarz algorithms. Motivated by machine-learning practice, our wider framework treats bounded operators (on infinite-dimensional Hilbert spaces), with updates realized as (block) Kaczmarz algorithms, leading to new and versatile results.
Computational Blueprints: Generating Isomorphic Mathematics Problems with Large Language Models
Kim, Jeong-Hoon, Nam, Jinwoo, Jo, Geunsik
Personalized mathematics education is growing rapidly, creating a strong demand for large sets of similar practice problems. Yet existing studies on mathematics problem generation have focused on data augmentation for training neural language models rather than on direct educational deployment. To bridge this gap, we define a new task, Isomorphic Math Problem Generation (IMPG), designed to produce structurally consistent variants of source problems. Subsequently, we explored LLM-based frameworks for automatic IMPG through successive refinements, and established Computational Blueprints for Isomorphic Twins (CBIT). With meta-level generation and template-based selective variation, CBIT achieves high mathematical correctness and structural consistency while reducing the cost of generation. Empirical results across refinements demonstrate that CBIT is superior on generation accuracy and cost-effectiveness at scale. Most importantly, CBIT-generated problems exhibited an error rate 17.8% lower than expert-authored items, with deployment to 6,732 learners on a commercial education platform yielding 186,870 interactions.
Re:Member: Emotional Question Generation from Personal Memories
Rackauckas, Zackary, Minematsu, Nobuaki, Hirschberg, Julia
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
Efficient Online Continual Learning in Sensor-Based Human Activity Recognition
Zhang, Yao, Clayton, Souza Leite, Xiao, Yu
Abstract--Machine learning models for sensor-based human activity recognition (HAR) are expected to adapt post-deployment to recognize new activities and different ways of performing existing ones. T o address this need, Online Continual Learning (OCL) mechanisms have been proposed, allowing models to update their knowledge incrementally as new data become available while preserving previously acquired information. However, existing OCL approaches for sensor-based HAR are computationally intensive and require extensive labeled samples to represent new changes. Recently, pre-trained model-based (PTM-based) OCL approaches have shown significant improvements in performance and efficiency for computer vision applications. These methods achieve strong generalization capabilities by pre-training complex models on large datasets, followed by fine-tuning on downstream tasks for continual learning. However, applying PTM-based OCL approaches to sensor-based HAR poses significant challenges due to the inherent heterogeneity of HAR datasets and the scarcity of labeled data in post-deployment scenarios. This paper introduces PTRN-HAR, the first successful application of PTM-based OCL to sensor-based HAR. This extractor is then frozen during the streaming stage. Furthermore, it replaces the conventional dense classification layer with a relation module network. Our design not only significantly reduces the resource consumption required for model training while maintaining high performance, but also improves data efficiency by reducing the amount of labeled data needed for effective continual learning, as demonstrated through experiments on three public datasets, outperforming the state-of-the-art. The code can be found here: https://anonymous.4open.science/r/PTRN-HAR-AF60/ HE recognition of human activities using wearable sensors such as Inertial Measurement Unit (IMU) encompasses many practical applications in smart homes [1], healthcare [2], and manufacturing [3]. HAR is typically achieved using machine learning models trained on sensor data from a predefined set of activity classes collected from a selected group of subjects. After deployment, these models often need to evolve to recognize new activities or adapt to the distribution shift caused by changes in users' activity patterns due to aging, disease, or simply a distinct manner of performing the same activity [4].
Towards Ecologically Valid LLM Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Li, Charlotte, Hagar, Nick, Nishal, Sachita, Gilbert, Jeremy, Diakopoulos, Nick
Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.