Personal
Congratulations to the #AAMAS2025 best paper, best demo, and distinguished dissertation award winners
The AAMAS 2025 best paper and demo awards were presented at the 24th International Conference on Autonomous Agents and Multiagent Systems, which took place from 19-23 May 2025 in Detroit. The Distinguished Dissertation Award was also recently announced. The Victor Lesser Distinguished Dissertation Award is given for dissertations in the field of autonomous agents and multiagent systems that show originality, depth, impact, as well as quality of writing, supported by high-quality publications.
A Human-Centric Approach to Explainable AI for Personalized Education
Deep neural networks form the backbone of artificial intelligence research, with potential to transform the human experience in areas ranging from autonomous driving to personal assistants, healthcare to education. However, their integration into the daily routines of real-world classrooms remains limited. It is not yet common for a teacher to assign students individualized homework targeting their specific weaknesses, provide students with instant feedback, or simulate student responses to a new exam question. While these models excel in predictive performance, this lack of adoption can be attributed to a significant weakness: the lack of explainability of model decisions, leading to a lack of trust from students, parents, and teachers. This thesis aims to bring human needs to the forefront of eXplainable AI (XAI) research, grounded in the concrete use case of personalized learning and teaching. We frame the contributions along two verticals: technical advances in XAI and their aligned human studies. We investigate explainability in AI for education, revealing systematic disagreements between post-hoc explainers and identifying a need for inherently interpretable model architectures. We propose four novel technical contributions in interpretability with a multimodal modular architecture (MultiModN), an interpretable mixture-of-experts model (InterpretCC), adversarial training for explainer stability, and a theory-driven LLM-XAI framework to present explanations to students (iLLuMinaTE), which we evaluate in diverse settings with professors, teachers, learning scientists, and university students. By combining empirical evaluations of existing explainers with novel architectural designs and human studies, our work lays a foundation for human-centric AI systems that balance state-of-the-art performance with built-in transparency and trust.
Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data
Zhang, Jihong, Liang, Xinya, Deng, Anqi, Bonge, Nicole, Tan, Lin, Zhang, Ling, Zarrett, Nicole
Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.
Sam Altman and Jony Ive Will Force A.I. Into Your Life
Ive led the designs of the original iMac, the iPad, and the Apple Watch, among other era-defining products. Then, in 2019, he left Apple to start his own design firm called LoveFrom. The news of his move to OpenAI felt something like learning that LeBron James was joining the Miami Heat: Ive had become synonymous with Apple's success, perhaps second only to Jobs. Now, after a period of independence, he was choosing a new team. The announcement of the deal with OpenAI--for a reported 6.5 billion in OpenAI equity--came via a press release, featuring a rather cuddly portrait of Ive with OpenAI's C.E.O. and co-founder, Sam Altman (shot by the British fashion photographer Craig McDean) and a faux-casual videotaped interview session between the two at San Francisco's Cafe Zoetrope. In it, Altman describes "a family of devices that would let people use A.I. to create all sorts of wonderful things," enabled by "magic intelligence in the cloud."
Federated Instrumental Variable Analysis via Federated Generalized Method of Moments
Geetika, null, Tyagi, Somya, Chatterjee, Bapi
Instrumental variables (IV) analysis is an important applied tool for areas such as healthcare and consumer economics. For IV analysis in high-dimensional settings, the Generalized Method of Moments (GMM) using deep neural networks offers an efficient approach. With non-i.i.d. data sourced from scattered decentralized clients, federated learning is a popular paradigm for training the models while promising data privacy. However, to our knowledge, no federated algorithm for either GMM or IV analysis exists to date. In this work, we introduce federated instrumental variables analysis (FedIV) via federated generalized method of moments (FedGMM). We formulate FedGMM as a federated zero-sum game defined by a federated non-convex non-concave minimax optimization problem, which is solved using federated gradient descent ascent (FedGDA) algorithm. One key challenge arises in theoretically characterizing the federated local optimality. To address this, we present properties and existence results of clients' local equilibria via FedGDA limit points. Thereby, we show that the federated solution consistently estimates the local moment conditions of every participating client. The proposed algorithm is backed by extensive experiments to demonstrate the efficacy of our approach.
Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Nath, Abhijnan, Graff, Carine, Bachinin, Andrei, Krishnaswamy, Nikhil
AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.
Benchmarking and Rethinking Knowledge Editing for Large Language Models
He, Guoxiu, Song, Xin, Wang, Futing, Sun, Aixin
Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model
de Haan, Tijmen, Ting, Yuan-Sen, Ghosal, Tirthankar, Nguyen, Tuan Dung, Accomazzi, Alberto, Herron, Emily, Lama, Vanessa, Pan, Rui, Wells, Azton, Ramachandra, Nesar
General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Patel, Hitesh Laxmichand, Agarwal, Amit, Das, Arion, Kumar, Bhargava, Panda, Srikant, Pattnayak, Priyaranjan, Rafi, Taki Hasan, Kumar, Tejaswini, Chae, Dong-Kyu
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
Let's Talk About ChatGPT and Cheating in the Classroom
There's been a lot of talk about how AI tools like ChatGPT are changing education. Students are using AI to do research, write papers, and get better grades. So today on the show, we debate whether using AI in school is actually cheating. Plus, we dive into how students and teachers are using these tools, and we ask what place AI should have in the future of learning. Write to us at uncannyvalley@wired.com.