Goto

Collaborating Authors

 arnold


Online child safety advocates urge California lawmakers to increase protections

Los Angeles Times

While child safety advocates agree progress was made at the state capital this year to protect children online, they argue there’s still a long way to go and plan to fight for more protections when legislators reconvene in January.


ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Liu, Jincheng, He, Sijun, Wu, Jingjing, Wang, Xiangsen, Chen, Yang, Kuang, Zhaoqi, Bao, Siqi, Yao, Yuan

arXiv.org Artificial Intelligence

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.


What does 'chance of precipitation' really mean? A meteorologist explains.

Popular Science

What does'chance of precipitation' really mean? Here's how to figure out if you can leave the umbrella at home. It's not always "when it rains, it pours." Breakthroughs, discoveries, and DIY tips sent every weekday. Understanding the weather forecast can sometimes feel like reading tea leaves.



EEFSUVA: A New Mathematical Olympiad Benchmark

Khatibi, Nicole N, Radamovich, Daniil A., Brenner, Michael P.

arXiv.org Artificial Intelligence

Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.


Researchers Are Already Leaving Meta's New Superintelligence Lab

WIRED

At least three artificial intelligence researchers have resigned from Meta's new superintelligence lab, just two months after CEO Mark Zuckerberg first announced the initiative. Two of the staffers have returned to OpenAI, where they both previously worked, after less than one-month stints at Meta, WIRED has confirmed. Ethan Knight worked at the ChatGPT maker earlier in his career but joined Meta from Elon Musk's xAI. A third researcher, Rishabh Agarwal, announced publicly on Monday he was leaving Meta's lab as well. He joined the tech giant in April to work on generative AI projects before switching to a role at Meta Superintelligence Labs (MSL), according to his LinkedIn profile.


Arnold: a generalist muscle transformer policy

Chiappa, Alberto Silvio, An, Boshi, Simos, Merkourios, Li, Chengkun, Mathis, Alexander

arXiv.org Artificial Intelligence

Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.



From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models

Liu, Jinyi, Zheng, Yan, Cheng, Rong, Wu, Qiyu, Guo, Wei, Ni, Fei, Liang, Hebin, Yuan, Yifu, Mao, Hangyu, Zhang, Fuzheng, Hao, Jianye

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.


Language Models Largely Exhibit Human-like Constituent Ordering Preferences

Tur, Ada Defne, Kamath, Gaurav, Reddy, Siva

arXiv.org Artificial Intelligence

Though English sentences are typically inflexible vis-\`a-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent's length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recent advances in NLP have led to significant gains in the performance of large language models (LLMs), much remains unclear about how these models process language, and how this compares to human language processing. In particular, the question remains whether LLMs display the same patterns with constituent movement, and may provide insights into existing theories on when and how the shift occurs in human language. We compare a variety of LLMs with diverse properties to evaluate broad LLM performance on four types of constituent movement: heavy NP shift, particle movement, dative alternation, and multiple PPs. Despite performing unexpectedly around particle movement, LLMs generally align with human preferences around constituent ordering.