playroom
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
Lu, Yi-Long, Zhang, Chunhui, Song, Jiajun, Fan, Lifeng, Wang, Wei
Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models ($\leq$3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Ye, Xi, Yin, Fangcong, He, Yinghui, Zhang, Joie, Yen, Howard, Gao, Tianyu, Durrett, Greg, Chen, Danqi
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Europe > Sweden > Stockholm > Stockholm (0.05)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- (33 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.92)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Astro Bot review – glittering ideas make Team Asobi's 3D platformer a gem
When I say that Astro Bot reminds me of Super Mario Galaxy, I could pay it no higher compliment. It has taken me around its own small galaxy of planetoid-style levels, from bathhouses to diorama-sized jungle temples to rainy islands, each host to a brilliant one-shot idea, such as a pair of frog boxing gloves or a backpack monkey or a time-stopping watch that lets you freeze giant zooming darts in place so you can jump on them. It is splendid to witness this development team's creativity let loose. Team Asobi has previously made a couple of short-form Astro Bot games – one for the PSVR, Rescue Mission, and another that came packaged with the PS5 at launch, Astro's Playroom – but this one is full-length, complete with challenging bonus levels that play out like electrified skill-check gauntlets for the generation raised on 3D platformers. It is supremely funny and characterful, thanks to the titular chibi blue-and-white robot and his crowd of friends, many of whom are dressed up as characters from the most obscure crevices of PlayStation history.
A Glimpse in ChatGPT Capabilities and its impact for AI research
Joublin, Frank, Ceravola, Antonello, Deigmoeller, Joerg, Gienger, Michael, Franzius, Mathias, Eggert, Julian
Large language models (LLMs) have recently become a popular topic in the field of Artificial Intelligence (AI) research, with companies such as Google, Amazon, Facebook, Amazon, Tesla, and Apple (GAFA) investing heavily in their development. These models are trained on massive amounts of data and can be used for a wide range of tasks, including language translation, text generation, and question answering. However, the computational resources required to train and run these models are substantial, and the cost of hardware and electricity can be prohibitive for research labs that do not have the funding and resources of the GAFA. In this paper, we will examine the impact of LLMs on AI research. The pace at which such models are generated as well as the range of domains covered is an indication of the trend which not only the public but also the scientific community is currently experiencing. We give some examples on how to use such models in research by focusing on GPT3.5/ChatGPT3.4 and ChatGPT4 at the current state and show that such a range of capabilities in a single system is a strong sign of approaching general intelligence. Innovations integrating such models will also expand along the maturation of such AI systems and exhibit unforeseeable applications that will have important impacts on several aspects of our societies.
- North America > United States (0.45)
- Europe > France (0.28)
- Asia > Malaysia (0.04)
- (3 more...)
- Research Report (1.00)
- Overview (0.92)
- Health & Medicine (1.00)
- Leisure & Entertainment > Games > Chess (0.46)
- Government > Regional Government > North America Government > United States Government (0.45)
Intrinsically Motivated Reinforcement Learning
Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous en- tities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing arti- ficial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. Psychologists distinguish between extrinsic motivation, which means being moved to do something because of some specific rewarding outcome, and intrinsic motivation, which refers to being moved to do something because it is inherently enjoyable. Intrinsic motiva- tion leads organisms to engage in exploration, play, and other behavior driven by curiosity in the absence of explicit reward. These activities favor the development of broad com- petence rather than being directed to more externally-directed goals (e.g., ref. [14]). In contrast, machine learning algorithms are typically applied to single problems and so do not cope flexibly with new problems as they arise over extended periods of time. Although the acquisition of competence may not be driven by specific problems, this com- petence is routinely enlisted to solve many different specific problems over the agent's lifetime.
Learning Rational Subgoals from Demonstrations and Instructions
Luo, Zhezheng, Mao, Jiayuan, Wu, Jiajun, Lozano-Pérez, Tomás, Tenenbaum, Joshua B., Kaelbling, Leslie Pack
We present a framework for learning useful subgoals that support efficient long-term planning to achieve novel goals. At the core of our framework is a collection of rational subgoals (RSGs), which are essentially binary classifiers over the environmental states. RSGs can be learned from weakly-annotated data, in the form of unsegmented demonstration trajectories, paired with abstract task descriptions, which are composed of terms initially unknown to the agent (e.g., collect-wood then craft-boat then go-across-river). Our framework also discovers dependencies between RSGs, e.g., the task collect-wood is a helpful subgoal for the task craft-boat. Given a goal description, the learned subgoals and the derived dependencies facilitate off-the-shelf planning algorithms, such as A* and RRT, by setting helpful subgoals as waypoints to the planner, which significantly improves performance-time efficiency.
- North America > United States > Oregon (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Iowa (0.04)
- Workflow (0.69)
- Research Report (0.63)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Zipfian environments for Reinforcement Learning
Chan, Stephanie C. Y., Lampinen, Andrew K., Richemond, Pierre H., Hill, Felix
As humans and animals learn in the natural world, they encounter distributions of entities, situations and events that are far from uniform. Typically, a relatively small set of experiences are encountered frequently, while many important experiences occur only rarely. The highly-skewed, heavy-tailed nature of reality poses particular learning challenges that humans and animals have met by evolving specialised memory systems. By contrast, most popular RL environments and benchmarks involve approximately uniform variation of properties, objects, situations or tasks. How will RL algorithms perform in worlds (like ours) where the distribution of environment features is far less uniform? To explore this question, we develop three complementary RL environments where the agent's experience varies according to a Zipfian (discrete power law) distribution. On these benchmarks, we find that standard Deep RL architectures and algorithms acquire useful knowledge of common situations and tasks, but fail to adequately learn about rarer ones. To understand this failure better, we explore how different aspects of current approaches may be adjusted to help improve performance on rare events, and show that the RL objective function, the agent's memory system and self-supervised learning objectives can all influence an agent's ability to learn from uncommon experiences. Together, these results show that learning robustly from skewed experience is a critical challenge for applying Deep RL methods beyond simulations or laboratories, and our Zipfian environments provide a basis for measuring future progress towards this goal.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine (0.93)
- Information Technology (0.93)
- Leisure & Entertainment > Games > Computer Games (0.67)
Five must-play games for your new PS5
It's that special time of year when the air is crisp, the holiday cheer is overwhelming and the only thing that sounds appealing is curling up on the couch with a big blanket and a brand new video game console. Luckily, the Xbox Series X and PlayStation 5 launched just over a month ago, and both of them are welcome living-room upgrades, offering more power, larger game worlds, and more seamless gameplay than the previous generation. In particular, the PS5 features a lineup of launch games worth your time. Here, we've collected five of the best titles available to play right now on PS5, just in time for your couch-based holiday plans. If you only play one game on PS5, make sure it's Miles Morales.