Education
Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Kamath, Anusha, Singla, Kanishk, Paul, Rakesh, Joshi, Raviraj, Vaidya, Utkarsh, Chauhan, Sanjay Singh, Wartikar, Niranjan
Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.
Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use
Department of Physics, University of Central Florida, 4111 Libra Drive, Orlando, Florida, USA 32816 We present a method for generating large numbers of isomorphic physics problems using generative AI services such as ChatGPT, through prompt chaining and tool use. This approach enables precise control over structural variations --such as numeric values and spatial relations -while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM -based methods. We generated two example isomorphic problem banks and compared the outcome against two simpler prompt - based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non -chaining prompts. We also show that GenAI services can be used to validate the quality of the generated isomorphic problems. This work demonstrates a promising method for efficient and scalable problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development. I. INTRODUCTION There has been significant progress in developing Automated Question Generation (AQG) and Automated Item Generation (AIG) technologies in education over the past decade. These technologies aim to reduce the time and cost of item creation while increasing t he availability of questions for both assessment and practice [1] . Early AQG/AIG approaches primarily relied on hard-coded, template-based methods, which were often time - consuming to develop and required domain-specific programming [2] . More recent research has shifted toward leveraging large language models (LLMs).
Near-Optimality of Contrastive Divergence Algorithms
Glaser, Pierre, Huang, Kevin Han, Gretton, Arthur
We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an $O(n^{-1 / 3})$ rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate $O(n^{-1 / 2})$. Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramรฉr-Rao lower bound.
Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation
Yan, Hao, Zhang, Heyan, Guo, Yongyi
The rise of large-scale pretrained models has made it feasible to generate predictive or synthetic features at low cost, raising the question of how to incorporate such surrogate predictions into downstream decision-making. We study this problem in the setting of online linear contextual bandits, where contexts may be complex, nonstationary, and only partially observed. In addition to bandit data, we assume access to an auxiliary dataset containing fully observed contexts--common in practice since such data are collected without adaptive interventions. We propose PULSE-UCB, an algorithm that leverages pretrained models trained on the auxiliary data to impute missing features during online decision-making. We establish regret guarantees that decompose into a standard bandit term plus an additional component reflecting pretrained model quality. In the i.i.d. context case with Hรถlder-smooth missing features, PULSE-UCB achieves near-optimal performance, supported by matching lower bounds. Our results quantify how uncertainty in predicted contexts affects decision quality and how much historical data is needed to improve downstream learning.
If you love AI, you'll love Ken Liu's new cyberpunk thriller
If you love AI, you'll love Ken Liu's new cyberpunk thriller In Ken Liu's All That We See or Seem, a once-famous hacker must find a missing dream-weaver. The latest novel by Ken Liu, All That We See or Seem, is the near-future story of the mysterious disappearance of a professional dream-weaver called Elli. It is being marketed as a cyberpunk thriller . Full disclosure: I don't generally seek out thrillers or cyberpunk books, so I may not be the target audience for this. But I was keen to read it because Liu has not one but two claims to fame: as well as being the author of a celebrated fantasy series called The Dandelion Dynasty, he is also the translator of the sensationally good Remembrance of Earth's Past trilogy by Cixin Liu .
Winners and Losers of the AI Revolution: Artificial Intelligence Is Radically Changing the Employment Landscape
Artificial intelligence is becoming a permanent element in the world of work, with Silicon Valley calling it the dawning of a new age. Many people are afraid of losing their job, but Germany is well-prepared. In the northern part of the U.S. state of Louisiana, right next to the prison on the outskirts of Shreveport, looms a gigantic building of concrete and steel. Welcome to the future," reads a colorful greeting painted on the wall at the entrance, right next to the obligatory American flag. It is 9:30 a.m., a busy time of day. Yet the halls and corridors of SHV1, as the building is referred to internally, are completely empty of people. A blueprint for the future," as the site manager calls it. The Seattle-based company operates the largest fleet of industrial robots in the world, more than a million of them, and many are outfitted with artificial intelligence, helping them to lift, sort, search, weigh and scan. Guided and directed completely by AI. Without the massive use of this technology," says Aaron Parness, a former NASA aerospace engineer who now heads up the retail giant's AI robotic department, we would be a different company." The article you are reading originally appeared in German in issue 41/2025 (October 2nd, 2025) of DER SPIEGEL. Amazon, though, also employs people. But their role is changing rapidly.
The quest to find out how our bodies react to extreme temperatures
Scientists hope to prevent deaths from climate change, but heat and cold are more complicated than we thought. Libby Cowgill is an anthropologist at the University of Missouri who hopes to revamp the science of thermoregulation. Libby Cowgill, an anthropologist in a furry parka, has wheeled me and my cot into a metal-walled room set to 40 F. A loud fan pummels me from above and siphons the dregs of my body heat through the cot's mesh from below. A large respirator fits snug over my nose and mouth. The device tracks carbon dioxide in my exhales--a proxy for how my metabolism speeds up or slows down throughout the experiment. Eventually Cowgill will remove my respirator to slip a wire-thin metal temperature probe several pointy inches into my nose. Cowgill and a graduate student quietly observe me from the corner of their so-called "climate chamber. Just a few hours earlier I'd sat beside them to observe as another volunteer, a 24-year-old personal trainer, endured the cold. Every few minutes, they measured his skin temperature with a thermal camera, his core temperature with a wireless pill, and his blood pressure and other metrics that hinted at how his body handles extreme cold. He lasted almost an hour without shivering; when my turn comes, I shiver aggressively on the cot for nearly an hour straight. I'm visiting Texas to learn about this experiment on how different bodies respond to extreme climates. I jokingly ask Cowgill as she tapes biosensing devices to my chest and legs. After I exit the cold, she surprises me: "You, believe it or not, were not the worst person we've ever seen." Climate change forces us to reckon with the knotty science of how our bodies interact with the environment. Cowgill is a 40-something anthropologist at the University of Missouri who powerlifts and teaches CrossFit in her spare time. She's small and strong, with dark bangs and geometric tattoos. Since 2022, she's spent the summers at the University of North Texas Health Science Center tending to these uncomfortable experiments. Her team hopes to revamp the science of thermoregulation. While we know in broad strokes how people thermoregulate, the science of keeping warm or cool is mottled with blind spots. "We have the general picture.
What's coming up at #IROS2025?
The 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) will be held from 19-25 October in Hangzhou, China. The programme includes plenary and keynote talks, workshops, tutorials, forums, competitions, and a debate. There are three plenary talks on the programme this year, with one per day on Tuesday 21, Wednesday 22, and Thursday 23 October. On Wednesday, a debate will be held on the following topic: The participants will be: XingXing Wang (Unitree Robotics), Jun-Oh Ho (Samsung and Rainbow Robotics), Hong Qiao (Chinese Academy of Sciences), Andra Keay, (Silicon Valley Robotics), Yu Sun (EiC, IEEE Trans on Automation Science and Engineering), Tamim Asfour (Professor of Humanoid Robotics, Karlsruhe Institute of Technology), Ken Goldberg (UC Berkeley, Moderator). There are three tutorials planned, taking place on Monday 20 and Friday 24 October.
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent
Li, Jiaqi, Lou, Zhipeng, Schmidt-Hieber, Johannes, Wu, Wei Biao
Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $q\ge2$ in general $\ell^s$-norms, and, in particular, the $\ell^{\infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.
LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
Zhao, Runcong, Bobrov, Artem, Li, Jiazheng, Aloisi, Cesare, He, Yulan
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.