Goto

Collaborating Authors

 game



The Real Demon Inside ChatGPT

WIRED

Language is meaningless without context. The sentence "I'm going to war" is ominous when said by the president of the United States but reassuring when coming from a bedbug exterminator. The problem with AI chatbots is that they often strip away historical and cultural context, leading users to be confused, alarmed, or, in the worst cases, misled in harmful ways. Last week, an editor at The Atlantic reported that OpenAI's ChatGPT had praised Satan while guiding her and several colleagues through a series of ceremonies encouraging "various forms of self-mutilation." There was a bloodletting ritual called " THE RITE OF THE EDGE" as well as a days-long "deep magic" experience called "The Gate of the Devourer."


Large Language Model Strategic Reasoning Evaluation through Behavioral Game Theory

Jia, Jingru, Yuan, Zehua, Pan, Junhao, McNamara, Paul E., Chen, Deming

arXiv.org Artificial Intelligence

Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.


Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation

Wang, Yu, Zhang, Jiaxin, Gao, Xiang, Cui, Wendi, Li, Peng, Das, Kamalika

arXiv.org Artificial Intelligence

In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter "contextual hallucination", where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called "Guided Attention Map Editing" (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed "edit directions'', strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.


Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

Li, Xingzuo, Chen, Kehai, Long, Yunfei, Bai, Xuefeng, Xu, Yong, Zhang, Min

arXiv.org Artificial Intelligence

Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.


Slate Crossword: 1994 Action-Comedy Starring Eddie Murphy … and a Droid Fluent in Ewokese? (16 Letters)

Slate

Please enable Javascript in your browser to view Slate interactives. Read about it in Slate: Move aside, Wall-E. The next great environmentalist children's movie is here.The Wild Robot is surprisingly mature about nature, nurture, and the wonders of life. Get Slate Games in your inbox every week day. You can manage your newsletter subscriptions at any time.


How Far are LLMs from Real Search? A Comprehensive Study on Efficiency, Completeness, and Inherent Capabilities

Lin, Minhua, Liu, Hui, Tang, Xianfeng, Zeng, Jingying, Dai, Zhenwei, Luo, Chen, Li, Zheng, Zhang, Xiang, He, Qi, Wang, Suhang

arXiv.org Artificial Intelligence

Search plays a fundamental role in problem-solving across various domains, with most real-world decision-making problems being solvable through systematic search. Drawing inspiration from recent discussions on search and learning, we systematically explore the complementary relationship between search and Large Language Models (LLMs) from three perspectives. First, we analyze how learning can enhance search efficiency and propose Search via Learning (SeaL), a framework that leverages LLMs for effective and efficient search. Second, we further extend SeaL to SeaL-C to ensure rigorous completeness during search. Our evaluation across three real-world planning tasks demonstrates that SeaL achieves near-perfect accuracy while reducing search spaces by up to 99.1% compared to traditional approaches. Finally, we explore how far LLMs are from real search by investigating whether they can develop search capabilities independently. Our analysis reveals that while current LLMs struggle with efficient search in complex problems, incorporating systematic search strategies significantly enhances their problem-solving capabilities. These findings not only validate the effectiveness of our approach but also highlight the need for improving LLMs' search abilities for real-world applications.


TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

Hudi, Frederikus, Winata, Genta Indra, Zhang, Ruochen, Aji, Alham Fikri

arXiv.org Artificial Intelligence

Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.


Director of the Game 'Avowed' Says AI Can't Replace Human Creativity

WIRED

As the video games industry continues to face massive layoffs, narrative jobs are taking the biggest hit. The industry's job cuts over the past couple of years--more than 30,000 roles were eliminated in 2023 and 2024--disproportionately affected narrative designers, the creative professionals who craft the story elements of the game and give a title its emotional punch. Even the director of the game Avowed, Carrie Patel--a successful author and narrative developer with over a decade of experience at the game studio Obsidian Entertainment--feels lucky she was able to start her career years ago. She can't imagine trying to break into the industry under today's conditions. "It just seems to be harder and harder to find a path in," Patel says. "I've heard colleagues hired within the last three or five years say essentially the same thing."


Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data

Zhou, Doudou, Tong, Han, Wang, Linshanshan, Liu, Suqi, Xiong, Xin, Gan, Ziming, Griffier, Romain, Hejblum, Boris, Liu, Yun-Chung, Hong, Chuan, Bonzel, Clara-Lea, Cai, Tianrun, Pan, Kevin, Ho, Yuk-Lam, Costa, Lauren, Panickan, Vidul A., Gaziano, J. Michael, Mandl, Kenneth, Jouhet, Vianney, Thiebaut, Rodolphe, Xia, Zongqi, Cho, Kelly, Liao, Katherine, Cai, Tianxi

arXiv.org Artificial Intelligence

The adoption of EHRs has expanded opportunities to leverage data-driven algorithms in clinical care and research. A major bottleneck in effectively conducting multi-institutional EHR studies is the data heterogeneity across systems with numerous codes that either do not exist or represent different clinical concepts across institutions. The need for data privacy further limits the feasibility of including multi-institutional patient-level data required to study similarities and differences across patient subgroups. To address these challenges, we developed the GAME algorithm. Tested and validated across 7 institutions and 2 languages, GAME integrates data in several levels: (1) at the institutional level with knowledge graphs to establish relationships between codes and existing knowledge sources, providing the medical context for standard codes and their relationship to each other; (2) between institutions, leveraging language models to determine the relationships between institution-specific codes with established standard codes; and (3) quantifying the strength of the relationships between codes using a graph attention network. Jointly trained embeddings are created using transfer and federated learning to preserve data privacy. In this study, we demonstrate the applicability of GAME in selecting relevant features as inputs for AI-driven algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. We then highlight the application of GAME harmonized multi-institutional EHR data in a study of Alzheimer's disease outcomes and suicide risk among patients with mental health disorders, without sharing patient-level data outside individual institutions.