Goto

Collaborating Authors

 pokémon


VideoGameBench: Can Vision-Language Models complete popular video games?

Zhang, Alex L., Griffiths, Thomas L., Narasimhan, Karthik R., Press, Ofir

arXiv.org Artificial Intelligence

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.


Turn-based Multi-Agent Reinforcement Learning Model Checking

Gross, Dennis

arXiv.org Artificial Intelligence

In this paper, we propose a novel approach for verifying the compliance of turn-based multi-agent reinforcement learning (TMARL) agents with complex requirements in stochastic multiplayer games. Our method overcomes the limitations of existing verification approaches, which are inadequate for dealing with TMARL agents and not scalable to large games with multiple agents. Our approach relies on tight integration of TMARL and a verification technique referred to as model checking. We demonstrate the effectiveness and scalability of our technique through experiments in different types of environments. Our experiments show that our method is suited to verify TMARL agents and scales better than naive monolithic model checking.


Evolving Virtual World with Delta-Engine

Wu, Hongqiu, Xu, Zekai, Xu, Tianyang, Wei, Shize, Wang, Yan, Hong, Jiale, Wu, Weiqi, Zhao, Hai, Zhang, Min, He, Zhezhi

arXiv.org Artificial Intelligence

In this paper, we focus on the \emph{virtual world}, a cyberspace where people can live in. An ideal virtual world shares great similarity with our real world. One of the crucial aspects is its evolving nature, reflected by individuals' capability to grow and thereby influence the objective world. Such dynamics is unpredictable and beyond the reach of existing systems. For this, we propose a special engine called \textbf{\emph{Delta-Engine}} to drive this virtual world. $\Delta$ associates the world's evolution to the engine's scalability. It consists of a base engine and a neural proxy. The base engine programs the prototype of the virtual world; given a trigger, the neural proxy generates new snippets on the base engine through \emph{incremental prediction}. This paper presents a full-stack introduction to the delta-engine. The key feature of the delta-engine is its scalability to unknown elements within the world, Technically, it derives from the prefect co-work of the neural proxy and the base engine, and the alignment with high-quality data. We introduce an engine-oriented fine-tuning method that embeds the base engine into the proxy. We then discuss the human-LLM collaborative design to produce novel and interesting data efficiently. Eventually, we propose three evaluation principles to comprehensively assess the performance of a delta engine: naive evaluation, incremental evaluation, and adversarial evaluation.


Covariate Assisted Entity Ranking with Sparse Intrinsic Scores

Fan, Jianqing, Hou, Jikai, Yu, Mengxin

arXiv.org Machine Learning

This paper addresses the item ranking problem with associate covariates, focusing on scenarios where the preference scores can not be fully explained by covariates, and the remaining intrinsic scores, are sparse. Specifically, we extend the pioneering Bradley-Terry-Luce (BTL) model by incorporating covariate information and considering sparse individual intrinsic scores. Our work introduces novel model identification conditions and examines the regularized penalized Maximum Likelihood Estimator (MLE) statistical rates. We then construct a debiased estimator for the penalized MLE and analyze its distributional properties. Additionally, we apply our method to the goodness-of-fit test for models with no latent intrinsic scores, namely, the covariates fully explaining the preference scores of individual items. We also offer confidence intervals for ranks. Our numerical studies lend further support to our theoretical findings, demonstrating validation for our proposed method


'Pokemon with guns' Palworld sold an insane 5 million copies this weekend

PCWorld

If you keep an eye on the gaming news at all, you might have heard about a strange little indie game in development over the last couple of years. Often stylized as "Pokemon with guns," Palworld is an odd Psyduck, a fusion of modern crafting games like Ark and Valheim with the familiar monster fighting of the world's most lucrative media franchise. Its launch this weekend was a shocking success, selling over five million copies, shooting it to the #3 concurrent player spot on Steam. No, Palworld gets the #1 spot with a bullet among recent releases. Only PUBG and Counter-Strike 2 have reached higher numbers. And that's apparently true despite the game being exclusive to Xbox and PC (missing the large PlayStation and Switch portions of the market) and being offered to millions of players for free via a day-one release on the Xbox Game Pass.


The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline

Wang, Haonan, Shen, Qianli, Tong, Yao, Zhang, Yang, Kawaguchi, Kenji

arXiv.org Artificial Intelligence

The commercialization of diffusion models, renowned for their ability to generate high-quality images that are often indistinguishable from real ones, brings forth potential copyright concerns. Although attempts have been made to impede unauthorized access to copyrighted material during training and to subsequently prevent DMs from generating copyrighted images, the effectiveness of these solutions remains unverified. This study explores the vulnerabilities associated with copyright protection in DMs by introducing a backdoor data poisoning attack (SilentBadDiffusion) against text-to-image diffusion models. Our attack method operates without requiring access to or control over the diffusion model's training or fine-tuning processes; it merely involves the insertion of poisoning data into the clean training dataset. This data, comprising poisoning images equipped with prompts, is generated by leveraging the powerful capabilities of multimodal large language models and text-guided image inpainting techniques. Our experimental results and analysis confirm the method's effectiveness. By integrating a minor portion of non-copyright-infringing stealthy poisoning data into the clean dataset-rendering it free from suspicion-we can prompt the finetuned diffusion models to produce copyrighted content when activated by specific trigger prompts. These findings underline potential pitfalls in the prevailing copyright protection strategies and underscore the necessity for increased scrutiny and preventative measures against the misuse of DMs.


Principal Trade-off Analysis

Strang, Alexander, SeWell, David, Kim, Alexander, Alcedo, Kevin, Rosenbluth, David

arXiv.org Artificial Intelligence

How are the advantage relations between a set of agents playing a game organized and how do they reflect the structure of the game? In this paper, we illustrate "Principal Trade-off Analysis" (PTA), a decomposition method that embeds games into a low-dimensional feature space. We argue that the embeddings are more revealing than previously demonstrated by developing an analogy to Principal Component Analysis (PCA). PTA represents an arbitrary two-player zero-sum game as the weighted sum of pairs of orthogonal 2D feature planes. We show that the feature planes represent unique strategic trade-offs and truncation of the sequence provides insightful model reduction. We demonstrate the validity of PTA on a quartet of games (Kuhn poker, RPS+2, Blotto, and Pokemon). In Kuhn poker, PTA clearly identifies the trade-off between bluffing and calling. In Blotto, PTA identifies game symmetries, and specifies strategic trade-offs associated with distinct win conditions. These symmetries reveal limitations of PTA unaddressed in previous work. For Pokemon, PTA recovers clusters that naturally correspond to Pokemon types, correctly identifies the designed trade-off between those types, and discovers a rock-paper-scissor (RPS) cycle in the Pokemon generation type - all absent any specific information except game outcomes.


PokemonChat: Auditing ChatGPT for Pok\'emon Universe Knowledge

Cabello, Laura, Li, Jiaang, Chalkidis, Ilias

arXiv.org Artificial Intelligence

The recently released ChatGPT model demonstrates unprecedented capabilities in zero-shot question-answering. In this work, we probe ChatGPT for its conversational understanding and introduce a conversational framework (protocol) that can be adopted in future studies. The Pok\'emon universe serves as an ideal testing ground for auditing ChatGPT's reasoning capabilities due to its closed world assumption. After bringing ChatGPT's background knowledge (on the Pok\'emon universe) to light, we test its reasoning process when using these concepts in battle scenarios. We then evaluate its ability to acquire new knowledge and include it in its reasoning process. Our ultimate goal is to assess ChatGPT's ability to generalize, combine features, and to acquire and reason over newly introduced knowledge from human feedback. We find that ChatGPT has prior knowledge of the Pokemon universe, which can reason upon in battle scenarios to a great extent, even when new information is introduced. The model performs better with collaborative feedback and if there is an initial phase of information retrieval, but also hallucinates occasionally and is susceptible to adversarial attacks.


Hitting the Books: How Pokemon took over the world

Engadget

The impact of Japanese RPGs on pop and gaming culture cannot be overstated. From Final Fantasy and Phantasy Star to Chrono Trigger, NieR, and Fire Emblem -- JRPGs have spanned console generations, bridged the Japanese and North American markets, spawned entire universes of IP and delivered critical commercial hits for nearly four decades. Modern gaming simply wouldn't exist as it does today if not for the influence of JRPGs. In his newest book, Fight, Magic, Items: The History of Final Fantasy, Dragon Quest, and the Rise of Japanese RPGs, Aidan Moher takes a wondrous in-depth look at the history of Japanese role playing games, their initial rise in the East, the long road to acceptance in the West and ultimate cultural impact the world over. In the excerpt below, Moher explores how Pokemon grew from Gameboy screens to become a multi-billion dollar entertainment juggernaut.


This Pokemon Generator Is Going Viral Over Its Crazy Creations

#artificialintelligence

Throughout the decades, Pokemon has taken the opportunity to introduce fans to some strange creatures throughout both the franchise's anime and the countless video games that place players into the roles of trainer seeking to catch powerful pocket monsters. While Ghost Type Pokemon are definitely some of the scariest creatures within the popular franchise, a new AI program is helping fans to create terrifying Pokemon of their own, introducing new pocket monsters to the various generations of the series that would have otherwise never have seen the light of day. If you haven't been following along with the adventures of Ash Ketchum in Pokemon Journeys, fans have been given some of the biggest adventures in the trainer's life, as the eternally young trainer has been on a world tour to help in celebrating his first-ever Pokemon Tournament victory. Joined by his new friend Goh, who is seeking to capture the Pokemon known as Mew to add to his ever-expanding roster, Ash has a new challenge ahead of him in defeating the trainers of Galar, with a future episode of the anime hinting that a road trip with none other than Galar's Champion Leon is in the cards for the anime's protagonist. This AI pokemon website is the funniest shit.