Goto

Collaborating Authors

 walkthrough



Supplementary material

Neural Information Processing Systems

This section contains supplementary material to support the main paper text. The contents include: A. Videos illustrating our approach. Additional walkthrough collection details to supplement Sec. 4 (simulators). Alternate local state task formulations compared to ours in Sec. Additional attention visualization results to supplement Figure 1.


VideoGameBench: Can Vision-Language Models complete popular video games?

Zhang, Alex L., Griffiths, Thomas L., Narasimhan, Karthik R., Press, Ofir

arXiv.org Artificial Intelligence

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.


MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

Ding, Peng, Fang, Jiading, Li, Peng, Wang, Kangrui, Zhou, Xiaochen, Yu, Mo, Li, Jing, Walter, Matthew R., Mei, Hongyuan

arXiv.org Artificial Intelligence

Large language models such as ChatGPT and GPT-4 have recently achieved astonishing performance on a variety of natural language processing tasks. In this paper, we propose MANGO, a benchmark to evaluate their capabilities to perform text-based mapping and navigation. Our benchmark includes 53 mazes taken from a suite of textgames: each maze is paired with a walkthrough that visits every location but does not cover all possible paths. The task is question-answering: for each maze, a large language model reads the walkthrough and answers hundreds of mapping and navigation questions such as "How should you go to Attic from West of House?" and "Where are we if we go north and east from Cellar?". Although these questions are easy to humans, it turns out that even GPT-4, the best-to-date language model, performs poorly at answering them. Further, our experiments suggest that a strong mapping and navigation ability would benefit large language models in performing relevant downstream tasks, such as playing textgames. Our MANGO benchmark will facilitate future research on methods that improve the mapping and navigation capabilities of language models. We host our leaderboard, data, code, and evaluation program at https://mango.ttic.edu and https://github.com/oaklight/mango/.


Will GPT-4 Run DOOM?

de Wynter, Adrian

arXiv.org Artificial Intelligence

We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. This large language model (LLM) is able to run and play the game with only a few instructions, plus a textual description--generated by the model itself from screenshots--about the state of the game being observed. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing. More complex prompting strategies involving multiple model calls provide better results. While further work is required to enable the LLM to play the game as well as its classical, reinforcement learning-based counterparts, we note that GPT-4 required no training, leaning instead on its own reasoning and observational capabilities. We hope our work pushes the boundaries on intelligent, LLM-based agents in video games. We conclude by discussing the ethical implications of our work.


JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions

Yu, Mo, Gu, Yi, Guo, Xiaoxiao, Feng, Yufei, Zhu, Xiaodan, Greenspan, Michael, Campbell, Murray, Gan, Chuang

arXiv.org Artificial Intelligence

Commonsense reasoning simulates the human ability to make presumptions about our physical world, and it is an essential cornerstone in building general AI systems. We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs as human players demonstrate plentiful and diverse commonsense reasoning. The new dataset provides a natural mixture of various reasoning types and requires multi-hop reasoning. Moreover, the IF game-based construction procedure requires much less human interventions than previous ones. Different from existing benchmarks, our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge. Hence, in order to achieve higher performance on our tasks, models need to effectively utilize such functional knowledge to infer the outcomes of actions, rather than relying solely on memorizing facts. Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models with a significant 20% performance gap compared to human experts.


Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions

Tsai, Chen Feng, Zhou, Xiaochen, Liu, Sierra S., Li, Jing, Yu, Mo, Mei, Hongyuan

arXiv.org Artificial Intelligence

Large language models (LLMs) such as ChatGPT and GPT-4 have recently demonstrated their remarkable abilities of communicating with human users. In this technical report, we take an initiative to investigate their capacities of playing text games, in which a player has to understand the environment and respond to situations by having dialogues with the game world. Our experiments show that ChatGPT performs competitively compared to all the existing systems but still exhibits a low level of intelligence. Precisely, ChatGPT can not construct the world model by playing the game or even reading the game manual; it may fail to leverage the world knowledge that it already has; it cannot infer the goal of each step as the game progresses.


Amazon.com: Practical MLOps: Operationalizing Machine Learning Models: 9781098103019: Gift, Noah, Deza, Alfredo: Books

#artificialintelligence

The first few chapters cover the theory and practice of both DevOps and MLOps. One of the items covered is how to set up continuous integration and continuous delivery. Another critical topic is Kaizen, i.e., the idea of continuous improvement in everything. There are three chapters on cloud computing that cover AWS, Azure, and GCP. Alfredo, a developer advocate for Microsoft, is an ideal source of knowledge for MLOps on the Azure platform. Likewise, Noah has spent years getting students trained on cloud computing and working with the education arms of Google, AWS, and Azure.


11 Ways to Learn More Data Science

#artificialintelligence

I've been a teacher at many grade levels, and I own a tutoring center that serves kids from age 4 to 18. I've tutored hundreds of students myself over 10 years. I've spent a lot of time trying to teach concepts, to students, peers, friends, direct reports, you name it. I say this because there is one thing that I beg you to listen to, and it's the number one issue I've seen in students at all levels: We just don't know what we don't know. People aren't great at seeing where their own understanding has small gaps. For any topic, we have a few lines of knowledge that we can spout, but we just aren't aware of the edge cases that exist until we see them. We don't have all the knowledge of how every topic intersects with every related one, and many times, those answers are not easy to figure out. Therein lies why experience is valuable. There is so much about even the basic Data Science topics that we haven't yet come across.


5 Practical Data Science Projects That Will Help You Solve Real Business Problems for 2022 - KDnuggets

#artificialintelligence

Recommendation systems are algorithms with an objective to suggest the most relevant information to users, whether that be similar products on Amazon, similar TV shows on Netflix, or similar songs on Spotify. There are two main types of recommendation systems: collaborative filtering and content-based filtering. Recommendation systems are one of the most widely used and most practical data science applications. Not only that, but it also has one of the highest ROIs when it comes to data products. It's estimated that Amazon increased its sales by 29% in 2019, specifically due to its recommendation system. As well, Netflix claimed that its recommendation system was worth a staggering $1 billion in 2016! But what makes it so profitable? As I alluded to earlier, it's about one thing: relevancy. By providing users with more relevant products, shows, or songs, you're ultimately increasing their likelihood to purchase more and/or stay engaged longer.