Large Language Model
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Ananthram, Amith, Stengel-Eskin, Elias, Bradford, Lorena A., Demarest, Julia, Purvis, Adam, Krut, Keith, Stein, Robert, Pantalony, Rina Elster, Bansal, Mohit, McKeown, Kathleen
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
General Exploratory Bonus for Optimistic Exploration in RLHF
Li, Wendi, Oh, Changdae, Li, Sharon
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $α$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $α$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment
Cao, Xiaoyang, Xu, Zelai, Guang, Mo, Long, Kaiwen, Bakker, Michiel A., Wang, Yu, Yu, Chao
Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.
Large Language Models Miss the Multi-Agent Mark
La Malfa, Emanuele, La Malfa, Gabriele, Marro, Samuele, Zhang, Jie M., Black, Elizabeth, Luck, Michael, Torr, Philip, Wooldridge, Michael
Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.
OpenAI Should Stop Naming Its Creations After Products That Already Exist
From "cameo" to "io," OpenAI keeps trying to call its new and upcoming releases by names that resemble existing trademarks. In September, OpenAI launched a way for users to generate a digital likeness of themselves they could use to create personalized deepfake videos . This is one of the core features in Sora, OpenAI's app for sharing AI videos inside a TikTok-style feed. The self-deepfaking feature was called "cameo," and with that standout feature, Sora quickly rose to the top of Apple's iOS download charts. This feature name led to a trademark lawsuit with Cameo, the app where fans can pay celebrities to record personalized videos.
OpenAI turns off ads on ChatGPT as AI falls short
When you purchase through links in our articles, we may earn a small commission. Expect them to be turned back on eventually, however. OpenAI has turned off ads appearing on ChatGPT while it works out how best to improve the model's precision, its top researchers said. In early December, a user complained about the nonsensical way in which ChatGPT was showing ads for Target below a conversation the user was having about Windows' BitLocker. In repsonse, Mark Chen, the chief research officer at OpenAI, said that the company would look into the situation.
The State of AI: A vision of the world in 2030
Senior AI editor Will Douglas Heaven talks with Tim Bradshaw, FT global tech correspondent, about what our world will look like in the next five years. Welcome back to The State of AI, a new collaboration between the and . You can read the rest of the series here. This is a subscriber-only event and you can sign up here .) Every time I'm asked what's coming next, I get a Luke Haines song stuck in my head: "Please don't ask me about the future / I am not a fortune teller." What will things be like in 2030?
The Download: four (still) big breakthroughs, and how our bodies fare in extreme heat
Plus: A CDC panel voted to recommend delaying the hepatitis B vaccine for babies. If you're a longtime reader, you probably know that our newsroom selects 10 breakthroughs every year that we think will define the future . This group exercise is mostly fun and always engrossing, with plenty of lively discussion along the way, but at times it can also be quite difficult. The 2026 list will come out on January 12--so stay tuned. In the meantime, we wanted to share some of the technologies from this year's reject pile, as a window into our decision-making process. These four technologies won't be on our 2026 list of breakthroughs, but all were closely considered, and we think they're worth knowing about.
Knowing Your Uncertainty -- On the application of LLM in social sciences
Zhang, Bolun, Li, Linzhuo, Chen, Yunqi, Zhao, Qinlin, Zhu, Zihan, Yi, Xiaoyuan, Xie, Xing
Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.
How to Tame Your LLM: Semantic Collapse in Continuous Systems
We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator $P: L^2(M,μ) \to L^2(M,μ)$ encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), $P$ is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of $P$ induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over $\mathbb{R}$. Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.