Education
Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
Stradi, Francesco Emanuele, Chiefari, Eleonora Fidelia, Castiglioni, Matteo, Marchesi, Alberto, Gatti, Nicola
We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $ฮฑ$-regret with respect to the \emph{unconstrained} optimum, where $ฮฑ$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages
Mishra, Debangan, Rastogi, Arihant, Negi, Agyeya, Goel, Shashwat, Kumaraguru, Ponnurangam
How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $ฮบ_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $ฮบ_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
Raman, Narun, Lundy, Taylor, Leyton-Brown, Kevin
When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K, MA TH, STEER-ME) and 27 different LLMs (including small models such as Qwen-2.5 7B Instruct, mid-sized models such as Llama-3.3 70B Instruct, and large state-of-the-art models such as OpenAI's o3). For each model-benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
Liu, Gabrielle Kaili-May, Yona, Gal, Caciularu, Avi, Szpektor, Idan, Rudner, Tim G. J., Cohan, Arman
A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of $\textit{faithful confidence calibration}$ of LLMs, benchmarking models' ability to use linguistic expressions of uncertainty that $\textit{faithfully reflect}$ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.
What happens when generative AI models train recursively on each others' outputs?
Vu, Hung Anh, Reeves, Galen, Wenger, Emily
The internet serves as a common source of training data for generative AI (genAI) models but is increasingly populated with AI-generated content. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding such data-mediated model interactions is critical. This work provides empirical evidence for how data-mediated interactions might unfold in practice, develops a theoretical model for this interactive training process, and experimentally validates the theory. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.
Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification
The Information Bottleneck (IB) objective uses information theory to formulate a task-performance versus robustness trade-off. It has been successfully applied in the standard discriminative classification setting. We pose the question whether the IB can also be used to train generative likelihood models such as normalizing flows. Since normalizing flows use invertible network architectures (INNs), they are information-preserving by construction. This seems contradictory to the idea of a bottleneck.