arnold
- North America > United States > California > Los Angeles County > Los Angeles (0.06)
- South America > Venezuela (0.04)
- North America > United States > New York (0.04)
- (6 more...)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Law > Statutes (0.95)
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
Liu, Jincheng, He, Sijun, Wu, Jingjing, Wang, Xiangsen, Chen, Yang, Kuang, Zhaoqi, Bao, Siqi, Yao, Yuan
Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
- North America > United States > Virginia (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > France (0.05)
- North America > United States > Oregon (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (6 more...)
EEFSUVA: A New Mathematical Olympiad Benchmark
Khatibi, Nicole N, Radamovich, Daniil A., Brenner, Michael P.
Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.
- Asia > Russia (0.25)
- Europe > Eastern Europe (0.24)
- North America > United States (0.04)
- (3 more...)
Researchers Are Already Leaving Meta's New Superintelligence Lab
At least three artificial intelligence researchers have resigned from Meta's new superintelligence lab, just two months after CEO Mark Zuckerberg first announced the initiative. Two of the staffers have returned to OpenAI, where they both previously worked, after less than one-month stints at Meta, WIRED has confirmed. Ethan Knight worked at the ChatGPT maker earlier in his career but joined Meta from Elon Musk's xAI. A third researcher, Rishabh Agarwal, announced publicly on Monday he was leaving Meta's lab as well. He joined the tech giant in April to work on generative AI projects before switching to a role at Meta Superintelligence Labs (MSL), according to his LinkedIn profile.
- North America > United States > California > San Mateo County > Menlo Park (0.06)
- North America > Canada (0.06)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
Arnold: a generalist muscle transformer policy
Chiappa, Alberto Silvio, An, Boshi, Simos, Merkourios, Li, Chengkun, Mathis, Alexander
Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.
- Europe > Switzerland > Vaud > Lausanne (0.04)
- North America > United States > Utah (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > France (0.05)
- North America > United States > Oregon (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (6 more...)
From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models
Liu, Jinyi, Zheng, Yan, Cheng, Rong, Wu, Qiyu, Guo, Wei, Ni, Fei, Liang, Hebin, Yuan, Yifu, Mao, Hangyu, Zhang, Fuzheng, Hao, Jianye
Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Language Models Largely Exhibit Human-like Constituent Ordering Preferences
Tur, Ada Defne, Kamath, Gaurav, Reddy, Siva
Though English sentences are typically inflexible vis-\`a-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent's length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recent advances in NLP have led to significant gains in the performance of large language models (LLMs), much remains unclear about how these models process language, and how this compares to human language processing. In particular, the question remains whether LLMs display the same patterns with constituent movement, and may provide insights into existing theories on when and how the shift occurs in human language. We compare a variety of LLMs with diverse properties to evaluate broad LLM performance on four types of constituent movement: heavy NP shift, particle movement, dative alternation, and multiple PPs. Despite performing unexpectedly around particle movement, LLMs generally align with human preferences around constituent ordering.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (6 more...)