Goto

Collaborating Authors

 Industry


Bandit and Delayed Feedback in Online Structured Prediction

Neural Information Processing Systems

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the *surrogate regret*, *i.e.,* the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, *bandit* and *delayed* feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound.


OpenAI says fake accounts from China tried to turn Americans against data centers

Engadget

The company has published a report about China-linked influence campaigns that used ChatGPT. OpenAI has published a report about ChatGPT users, who it says were likely based in China, that used the chatbot to plan a campaign designed to sway Americans' opinions about AI data centers. It divided the users into two clusters, the first of which it had designated the Data Center Bandwagon group. Accounts categorized in the group allegedly asked ChatGPT to generate English-language talking points and images, such as comic strips, which focus on how AI data centers drive up demand in electricity and how that leads to higher bills for consumers. The company says these users posed as Americans from a variety of backgrounds on social media, where they had posted the text and image output they got from ChatGPT.


PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Neural Information Processing Systems

Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations.We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.


HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Neural Information Processing Systems

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present $\textbf{HARDMath2}$, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.


Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings

Neural Information Processing Systems

Generating diverse, all atom conformational ensembles of dynamic proteins such as G protein coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all atom protein structures, including every side chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral angle losses, maps back to Cartesian coordinates. Using D2R-MD, a $2\mu\text{s}$ MD trajectory (12 000 frames) of the human dopamine D$2$ receptor in a membrane environment, the sequential and residue-based pooling strategies reproduce the reference ensemble with high structural fidelity (all atom lDDT \~ $0.7$; $C\alpha$-lDDT \~ $0.8$) and recovers backbone and side chain dihedral angle distributions with a Jensen-Shannon divergence $


CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization

Neural Information Processing Systems

Machine learning solutions are rapidly adopted to enable a variety of key use cases, from conversational AI assistants to scientific discovery. As the adoption of machine learning models becomes increasingly prevalent, the associated lifecycle carbon footprint is expected to increase, including both from training and inference and from AI hardware manufacturing. We introduce CATransformers, the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators. By integrating both operational and embodied carbon into early-stage design space exploration, CATransformers enables sustainability-driven model architecture and hardware accelerator co-design that reveals fundamentally different trade-offs than latency-or energy-centric approaches. Evaluated across a range of Transformer models, CATransformers consistently demonstrates the potential to reduce total carbon emissions --by up to 30\% -- while maintaining accuracy and latency. We further highlight its extensibility through a focused case study on multi-modal models. Our results emphasize the need for holistic optimization methods that prioritize carbon efficiency without compromising model capability and execution time performance.


Can Large Language Models Master Complex Card Games?

Neural Information Processing Systems

Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.


Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning

Neural Information Processing Systems

Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions.


In aging South Korea, AI dolls are caring for the elderly

The Japan Times

Bang Chun-ja, a 78-year-old South Korean woman living alone, holding Hyodol, an artificial intelligence-powered healthcare doll designed for the elderly, during an interview at her home in Yongin in April. Yongin, South Korea - In her tiny apartment in South Korea where she lives alone, 78-year-old Bang Chun-ja spends her days with a childlike artificial intelligence-powered doll she says she prefers to people. The doll greets Bang when she returns home, sings to her when she feels bored, reminds her not to skip meals or medication -- helping her maintain a routine -- and tells her it loves her. Bang has limited contact with her grown-up daughter, and fell into severe depression after major back surgery, spending hours alone staring at the ceiling in pain. In a time of both misinformation and too much information, quality journalism is more crucial than ever.


Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Neural Information Processing Systems

Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. We make all of our experimental data and code available.