AITopics

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.30)

Licato, John, Steinle, Stephen, Hollis, Brayden

Do Persona-Infused LLMs Affect Performance in a Strategic Reasoning Game?

arXiv.org Artificial IntelligenceDec-9-2025

Although persona prompting in large language models appears to trigger different styles of generated text, it is unclear whether these translate into measurable behavioral differences, much less whether they affect decision-making in an adversarial strategic environment that we provide as open-source. We investigate the impact of persona prompting on strategic performance in PERIL, a world-domination board game. Specifically, we compare the effectiveness of persona-derived heuristic strategies to those chosen manually. Our findings reveal that certain personas associated with strategic thinking improve game performance, but only when a mediator is used to translate personas into heuristic values. We introduce this mediator as a structured translation process, inspired by exploratory factor analysis, that maps LLM-generated inventory responses into heuristics. Results indicate our method enhances heuristic reliability and face validity compared to directly inferred heuristics, allowing us to better study the effect of persona types on decision making. These insights advance our understanding of how persona prompting influences LLM-based decision-making and propose a heuristic generation method that applies psychometric principles to LLMs.

large language model, machine learning, persona, (15 more...)

2512.06867

Country: North America > United States > Maine (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Information Technology (1.00)
Government > Military (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsNov-21-2025, 03:53:06 GMT

Learning from Group Comparisons: Exploiting Higher Order Interactions

Yao Li, Minhao Cheng, Kevin Fujii, Fushing Hsieh, Cho-Jui Hsieh

We study the problem of learning from group comparisons, with applications in predicting outcomes of sports and online games. Most of the previous works in this area focus on learning individual effects--they assume each player has an underlying score, and the "ability" of the team is modeled by the sum of team

artificial intelligence, interaction, machine learning, (18 more...)

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Yolo County > Davis (0.05)
Asia > Middle East > Jordan (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.47)

Industry: Leisure & Entertainment > Games > Computer Games (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsOct-3-2025, 03:21:58 GMT

dataset release, tournament evaluation, architectural design, input representation, and other insights

We want to thank the reviewers for their helpful comments. The dataset will be made available to any interested researchers. We agree with R3 that there are a lot of non-trivial modeling choices in our architecture. We call the first one unit-based and the latter token-based. We apologize for writing some of the claims without referring to the evidence, like "orders from the last movement Our input representation is a result of both empirical findings and domain knowledge.

artificial intelligence, input representation, tournament evaluation, (16 more...)

Industry: Construction & Engineering (0.42)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.30)

arXiv.org Artificial IntelligenceOct-3-2025

Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Tang, Raphael, Zhang, Crystina, Li, Wenyan, Lai, Carmen, Stenetorp, Pontus, Lu, Yao

In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

large language model, natural language, rating system, (18 more...)

2510.02306

Genre: Research Report (0.83)

Industry: Leisure & Entertainment > Games > Chess (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsOct-2-2025, 21:18:24 GMT

4d95d05a4fc4eadbc3b9dde67afdca39-AuthorFeedback.pdf

artificial intelligence, complexity, experiment, (18 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.30)

Gould, Dewi S. W., Mlodozeniec, Bruno, Brown, Samuel F.

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

arXiv.org Artificial IntelligenceAug-11-2025

Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

artificial intelligence, large language model, natural language, (18 more...)

2508.06111

Country: Europe > United Kingdom > England (0.46)

Genre: Research Report > New Finding (0.48)

Industry: Leisure & Entertainment > Games > Chess (0.86)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Shorewala, Shivam, Yang, Zihao

Beyond Winning: Margin of Victory Relative to Expectation Unlocks Accurate Skill Ratings

arXiv.org Machine LearningJun-3-2025

Knowledge of accurate relative skills in any competitive system is essential, but foundational approaches such as ELO discard extremely relevant performance data by concentrating exclusively on binary outcomes. While margin of victory (MOV) extensions exist, they often lack a definitive method for incorporating this information. We introduce Margin of Victory Differential Analysis (MOVDA), a framework that enhances traditional rating systems by using the deviation between the true MOV and a $\textit{modeled expectation}$. MOVDA learns a domain-specific, non-linear function (a scaled hyperbolic tangent that captures saturation effects and home advantage) to predict expected MOV based on rating differentials. Crucially, the $\textit{difference}$ between the true and expected MOV provides a subtle and weighted signal for rating updates, highlighting informative deviations in all levels of contests. Extensive experiments on professional NBA basketball data (from 2013 to 2023, with 13,619 games) show that MOVDA significantly outperforms standard ELO and Bayesian baselines. MOVDA reduces Brier score prediction error by $1.54\%$ compared to TrueSkill, increases outcome accuracy by $0.58\%$, and most importantly accelerates rating convergence by $13.5\%$, while maintaining the computational efficiency of the original ELO updates. MOVDA offers a theoretically motivated, empirically superior, and computationally lean approach to integrating performance magnitude into skill rating for competitive environments like the NBA.

artificial intelligence, machine learning, mov, (19 more...)

arXiv.org Machine Learning

2506.00348

Country: North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment > Sports > Football (0.83)
Leisure & Entertainment > Sports > Basketball (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Goto, Takumi, Sakai, Yusuke, Watanabe, Taro

Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?

arXiv.org Artificial IntelligenceFeb-13-2025

One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, $n$-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. We publish our unified implementation of the metrics and meta-evaluations.

large language model, machine learning, natural language, (18 more...)

2502.09416

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Washington > King County > Seattle (0.04)
(12 more...)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Data Science > Data Quality > Data Cleaning (0.64)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Shrivastava, Vaishnavi, Kumar, Ananya, Liang, Percy

Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

arXiv.org Artificial IntelligenceFeb-3-2025

Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence ("Score your confidence from 0-1.") is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence ("Which question are you more confident in answering correctly?"). Treating each question as a "player" in a series of matchups against other questions and the model's preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model's confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs -- GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B -- across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.

large language model, machine learning, natural language, (19 more...)

2502.01126

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Education (0.68)
Leisure & Entertainment > Games > Chess (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)