Goto

Collaborating Authors

 Georgia


ChessGPT: Bridging Policy Learning and Language Modeling Xidong Feng

Neural Information Processing Systems

Chess, one of the oldest and most universally played board games, presents an ideal testbed due to the wealth of both policy data and language data. In terms of policy data, it is reported that over ten million games are played daily on Chess.com, the most frequented online chess platform.



Non-Monotonic S4F Standpoint Logic (Extended Version with Proofs)

Gorczyca, Piotr, Strass, Hannes

arXiv.org Artificial Intelligence

Standpoint logics offer unified modal logic-based formalisms for representing multiple heterogeneous viewpoints. At the same time, many non-monotonic reasoning frameworks can be naturally captured using modal logics - in particular using the modal logic S4F. In this work, we propose a novel formalism called S4F Standpoint Logic, which generalises both S4F and propositional standpoint logic and is therefore capable of expressing multi-viewpoint, non-monotonic semantic commitments. We define its syntax and semantics and analyze its computational complexity, obtaining the result that S4F Standpoint Logic is not computationally harder than its constituent logics, whether in monotonic or non-monotonic form. We also outline mechanisms for credulous and sceptical acceptance and illustrate the framework with an example.



Energy Approach from $\varepsilon$-Graph to Continuum Diffusion Model with Connectivity Functional

Yang, Yahong, Lee, Sun, Calder, Jeff, Hao, Wenrui

arXiv.org Machine Learning

We derive an energy-based continuum limit for $\varepsilon$-graphs endowed with a general connectivity functional. We prove that the discrete energy and its continuum counterpart differ by at most $O(\varepsilon)$; the prefactor involves only the $W^{1,1}$-norm of the connectivity density as $\varepsilon\to0$, so the error bound remains valid even when that density has strong local fluctuations. As an application, we introduce a neural-network procedure that reconstructs the connectivity density from edge-weight data and then embeds the resulting continuum model into a brain-dynamics framework. In this setting, the usual constant diffusion coefficient is replaced by the spatially varying coefficient produced by the learned density, yielding dynamics that differ significantly from those obtained with conventional constant-diffusion models.


APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

Qin, Jiarui, Xi, Yunjia, Huang, Junjie, Rui, Renting, Yin, Di, Liu, Weiwen, Yu, Yong, Zhang, Weinan, Sun, Xing

arXiv.org Artificial Intelligence

With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.


Georgia arrests three Chinese nationals for trying to illegally buy uranium

BBC News

Three Chinese nationals have been arrested in Georgia on suspicion of attempting to illegally purchase 2kg of uranium. Lasha Maghradze, deputy head of the nation's State Security Service (SSG), told a news briefing the group planned to pay $400,000 (£300,570) for the nuclear material in the capital, Tblisi, before transporting it to China via Russia. The alleged plot was unearthed by intelligence agents while one member of the group was attempting to buy the radioactive substance on the black market, he said. The three pleaded not guilty at a court in Tblisi and have been placed in custody to prevent them fleeing the country, according to public broadcaster Georgia Today. They face up to five years in prison under a provision of Georgia's criminal code banning the purchasing of nuclear material.


TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

He, Pengfei, Dai, Zhenwei, He, Bing, Liu, Hui, Tang, Xianfeng, Lu, Hanqing, Li, Juanhui, Ding, Jiayuan, Mukherjee, Subhabrata, Wang, Suhang, Xing, Yue, Tang, Jiliang, Dumoulin, Benoit

arXiv.org Artificial Intelligence

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.