codename
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
Hakimov, Sherzod, Pfennigschmidt, Lara, Schlangen, David
This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.
- Europe > Austria > Vienna (0.14)
- North America > Canada (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- (8 more...)
Improving Cooperation in Language Games with Bayesian Inference and the Cognitive Hierarchy
Bills, Joseph, Archibald, Christopher, Blaylock, Diego
In two-player cooperative games, agents can play together effectively when they have accurate assumptions about how their teammate will behave, but may perform poorly when these assumptions are inaccurate. In language games, failure may be due to disagreement in the understanding of either the semantics or pragmatics of an utterance. We model coarse uncertainty in semantics using a prior distribution of language models and uncertainty in pragmatics using the cognitive hierarchy, combining the two aspects into a single prior distribution over possible partner types. Fine-grained uncertainty in semantics is modeled using noise that is added to the embeddings of words in the language. To handle all forms of uncertainty we construct agents that learn the behavior of their partner using Bayesian inference and use this information to maximize the expected value of a heuristic function. We test this approach by constructing Bayesian agents for the game of Codenames, and show that they perform better in experiments where semantics is uncertain
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Codenames as a Benchmark for Large Language Models
Stephenson, Matthew, Sidji, Matthew, Ronval, Benoît
In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.
Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method
Xia, Tian, He, Zhiwei, Ren, Tong, Miao, Yibo, Zhang, Zhuosheng, Yang, Yang, Wang, Rui
Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents' bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent's performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents' bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer's performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer's offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer's deal rates from 26.67% to 88.88% and brings a ten times multiplication of profits on all baselines, even a model that has not been aligned.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
- Information Technology (0.46)
- Health & Medicine (0.46)
Adapting to Teammates in a Cooperative Language Game
Archibald, Christopher, Brosnahan, Spencer
The game of Codenames has recently emerged as a domain of interest for intelligent agent design. The game is unique due to the way that language and coordination between teammates play important roles. Previous approaches to designing agents for this game have utilized a single internal language model to determine action choices. This often leads to good performance with some teammates and inferior performance with other teammates, as the agent cannot adapt to any specific teammate. In this paper we present the first adaptive agent for playing Codenames. We adopt an ensemble approach with the goal of determining, during the course of interacting with a specific teammate, which of our internal expert agents, each potentially with its own language model, is the best match. One difficulty faced in this approach is the lack of a single numerical metric that accurately captures the performance of a Codenames team. Prior Codenames research has utilized a handful of different metrics to evaluate agent teams. We propose a novel single metric to evaluate the performance of a Codenames team, whether playing a single team (solitaire) game, or a competitive game against another team. We then present and analyze an ensemble agent which selects an internal expert on each turn in order to maximize this proposed metric. Experimental analysis shows that this ensemble approach adapts to individual teammates and often performs nearly as well as the best internal expert with a teammate. Crucially, this success does not depend on any previous knowledge about the teammates, the ensemble agents, or their compatibility. This research represents an important step to making language-based agents for cooperative language settings like Codenames more adaptable to individual teammates.
- North America > United States > Utah > Utah County > Provo (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Bing: "I will not harm you unless you harm me first"
Last week, Microsoft announced the new AI-powered Bing: a search interface that incorporates a language model powered chatbot that can run searches for you and summarize the results, plus do all of the other fun things that engines like GPT-3 and ChatGPT have been demonstrating over the past few months: the ability to generate poetry, and jokes, and do creative writing, and so much more. This week, people have started gaining access to it via the waiting list. It's increasingly looking like this may be one of the most hilariously inappropriate applications of AI that we've seen yet. If you haven't been paying attention, here's what's transpired so far. The demo that introduced AI Bing to the world was really compelling: they showed shopping comparison, and trip itinerary planning, and financial statement summarization. Then Dmitri Brereton did some fact checking against the examples from the demo. It said that the cons of the "Bissell Pet Hair Eraser Handheld Vacuum" included a "short cord length of 16 feet", when that vacuum has no cord at all--and that "it's noisy enough to scare pets" when online reviews note that it's really quiet. Update: My apologies to Bing, it turns out there is indeed a corded version of this vacuum with a 16 foot cord. It recommended a "rustic and charming" bar in Mexico City without noting that it's also one of the oldest gay bars in Mexico City.
- North America > Mexico > Mexico City > Mexico City (0.45)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Best applications of Deep Reinforcement Learning 2022 part4
Abstract: Although most reinforcement learning research has centered on competitive games, little work has been done on applying it to co-operative multiplayer games or text-based games. Codenames is a board game that involves both asymmetric co-operation and natural language processing, which makes it an excellent candidate for advancing RL research. To my knowledge, this work is the first to formulate Codenames as a Markov Decision Process and apply some well-known reinforcement learning algorithms such as SAC, PPO, and A2C to the environment. Although none of the above algorithms converge for the Codenames environment, neither do they converge for a simplified environment called ClickPixel, except when the board size is small. Abstract: In this paper, we employ multiple UAVs coordinated by a base station (BS) to help the ground users (GUs) to offload their sensing data.
- Leisure & Entertainment > Games (0.59)
- Telecommunications (0.39)
Playing Codenames with Language Graphs and Word Embeddings
Koyyalagunta, Divya | Sun, Anna | Draelos, Rachel Lea (Duke University) | Rudin, Cynthia (Duke University)
Although board games and video games have been studied for decades in artificial intelligence research, challenging word games remain relatively unexplored. Word games are not as constrained as games like chess or poker. Instead, word game strategy is defined by the players' understanding of the way words relate to each other. The word game Codenames provides a unique opportunity to investigate common sense understanding of relationships between words, an important open challenge. We propose an algorithm that can generate Codenames clues from the language graph BabelNet or from any of several embedding methods - word2vec, GloVe, fastText or BERT. We introduce a new scoring function that measures the quality of clues, and we propose a weighting term called DETECT that incorporates dictionary-based word representations and document frequency to improve clue selection. We develop BabelNet-Word Selection Framework (BabelNet-WSF) to improve BabelNet clue quality and overcome the computational barriers that previously prevented leveraging language graphs for Codenames. Extensive experiments with human evaluators demonstrate that our proposed innovations yield state-of-the-art performance, with up to 102.8% improvement in precision@2 in some cases.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Middle East > Egypt (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (6 more...)
Playing Codenames with Language Graphs and Word Embeddings
Koyyalagunta, Divya, Sun, Anna, Draelos, Rachel Lea, Rudin, Cynthia
Although board games and video games have been studied for decades in artificial intelligence research, challenging word games remain relatively unexplored. Word games are not as constrained as games like chess or poker. Instead, word game strategy is defined by the players' understanding of the way words relate to each other. The word game Codenames provides a unique opportunity to investigate common sense understanding of relationships between words, an important open challenge. We propose an algorithm that can generate Codenames clues from the language graph BabelNet or from any of several embedding methods - word2vec, GloVe, fastText or BERT. We introduce a new scoring function that measures the quality of clues, and we propose a weighting term called DETECT that incorporates dictionary-based word representations and document frequency to improve clue selection. We develop BabelNet-Word Selection Framework (BabelNet-WSF) to improve BabelNet clue quality and overcome the computational barriers that previously prevented leveraging language graphs for Codenames. Extensive experiments with human evaluators demonstrate that our proposed innovations yield state-of-the-art performance, with up to 102.8% improvement in precision@2 in some cases.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Middle East > Egypt (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (6 more...)
Intel Pledges First Commercial Nervana Product 'Spring Crest' in 2019
At its AI developer conference in San Francisco yesterday, Intel embraced a holistic approach to AI and showed off a broad AI portfolio that includes Xeon processors, Movidius technologies, FPGAs and Intel's Nervana Neural Network Processors (NNPs), based on the technology it acquired in 2016. In his opening keynote, Naveen Rao, general manager of Intel's artificial intelligence products group and former CEO of Nervana Systems, revealed that the first commercial Nervana product will debut in late 2019 and will be called NNP L-1000 (codename: Spring Crest). Intel anticipates that Spring Crest will offer 3-4x the training performance of its development product Lake Crest. Originally scheduled for availability last year, Lake Crest is being used as a software development vehicle to gather feedback from early partners. This is reminiscent of how Intel handled its first Phi product, Knights Ferry, a development prototype that was never widely available.