target number
FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance
Huang, Yu-Shiang, Lee, Yun-Yu, Chou, Tzu-Hsin, Lin, Che, Wang, Chuan-Ju
BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.15)
- Asia > South Korea > Seoul > Seoul (0.05)
- Asia > Taiwan > Taiwan Province > Taipei (0.05)
- (8 more...)
GEM: A Gym for Agentic LLMs
Liu, Zichen, Sims, Anya, Duan, Keyu, Chen, Changyu, Yu, Simon, Zhou, Xiangxin, Xu, Haotian, Xiong, Shaopan, Liu, Bo, Tan, Chenmien, Beh, Chuen Yang, Wang, Weixun, Zhu, Hao, Shi, Weiyan, Yang, Diyi, Shieh, Michael, Teh, Yee Whye, Lee, Wee Sun, Lin, Min
The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.
Smoothing-Based Conformal Prediction for Balancing Efficiency and Interpretability
Zheng, Mingyi, Jiang, Hongyu, Lu, Yizhou, Teng, Jiaye
Conformal Prediction (CP) is a distribution-free framework for constructing statistically rigorous prediction sets. While popular variants such as CD-split improve CP's efficiency, they often yield prediction sets composed of multiple disconnected subintervals, which are difficult to interpret. In this paper, we propose SCD-split, which incorporates smoothing operations into the CP framework. Such smoothing operations potentially help merge the subintervals, thus leading to interpretable prediction sets. Experimental results on both synthetic and real-world datasets demonstrate that SCD-split balances the interval length and the number of disconnected subintervals. Theoretically, under specific conditions, SCD-split provably reduces the number of disconnected subintervals while maintaining comparable coverage guarantees and interval length compared with CD-split.
- Asia > Middle East > Jordan (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game
Katz, Michael, Kokel, Harsha, Sreedharan, Sarath
There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.
- North America > United States > Colorado > Larimer County > Fort Collins (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
Humans expect rationality and cooperation from LLM opponents in strategic games
Barak, Darija, Costa-Gomes, Miguel
As Large Language Models (LLMs) integrate into our social and economic interactions, we need to deepen our understanding of how humans respond to LLMs opponents in strategic settings. We present the results of the first controlled monetarily-incentivised laboratory experiment looking at differences in human behaviour in a multi-player p-beauty contest against other humans and LLMs. We use a within-subject design in order to compare behaviour at the individual level. We show that, in this environment, human subjects choose significantly lower numbers when playing against LLMs than humans, which is mainly driven by the increased prevalence of `zero' Nash-equilibrium choices. This shift is mainly driven by subjects with high strategic reasoning ability. Subjects who play the zero Nash-equilibrium choice motivate their strategy by appealing to perceived LLM's reasoning ability and, unexpectedly, propensity towards cooperation. Our findings provide foundational insights into the multi-player human-LLM interaction in simultaneous choice games, uncover heterogeneities in both subjects' behaviour and beliefs about LLM's play when playing against them, and suggest important implications for mechanism design in mixed human-LLM systems.
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Lee, Andrew, Sun, Lihao, Wendler, Chris, Viégas, Fernanda, Wattenberg, Martin
How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.
Benchmarking Multi-Object Grasping
Chen, Tianze, Frumento, Ricardo, Pagnanelli, Giulia, Cei, Gianmarco, Keth, Villa, Gafarov, Shahaddin, Gong, Jian, Ye, Zihe, Baracca, Marco, D'Avella, Salvatore, Bianchi, Matteo, Sun, Yu
--In this work, we describe a multi-object grasping benchmark to evaluate the grasping and manipulation capabilities of robotic systems in both pile and surface scenarios. The benchmark introduces three robot multi-object grasping benchmarking protocols designed to challenge different aspects of robotic manipulation. These protocols are: 1) the Only-Pick-Once protocol, which assesses the robot's ability to efficiently pick multiple objects in a single attempt; 2) the Accurate pick-trnsferring protocol, which evaluates the robot's capacity to selectively grasp and transport a specific number of objects from a cluttered environment; and 3) the Pick-transferring-all protocol, which challenges the robot to clear an entire scene by sequentially grasping and transferring all available objects. These protocols are intended to be adopted by the broader robotics research community, providing a standardized method to assess and compare robotic systems' performance in multi-object grasping tasks. We establish baselines for these protocols using standard planning and perception algorithms on a Barrett hand, Robotiq parallel jar gripper, and the Pisa/IIT Softhand-2, which is a soft underactuated robotic hand. We discuss the results in relation to human performance in similar tasks we well. The authors are from the Robot Perception and Action Lab (RP AL) of Computer Science and Engineering Department, University of South Florida, Tampa, FL 33620, USA. The authors are with the Research Center "E. The author is with is with Rutgers University, New Brunswick, NJ 08901, USA. Related work was finished when Zihe Y e was a Master's student in the RP AL lab at USF. The author is with the Department of Excellence in Robotics & AI, Mechanical Intelligence Institute, Scuola Superiore Sant'Anna, Pisa, Italy.
- North America > United States > Florida > Hillsborough County > Tampa (0.54)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.24)
- North America > United States > New Jersey > Middlesex County > New Brunswick (0.24)
- (2 more...)
SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph
Schubert, Mátyás, Claassen, Tom, Magliacane, Sara
Causal discovery can be computationally demanding for large numbers of variables. If we only wish to estimate the causal effects on a small subset of target variables, we might not need to learn the causal graph for all variables, but only a small subgraph that includes the targets and their adjustment sets. In this paper, we focus on identifying causal effects between target variables in a computationally and statistically efficient way. This task combines causal discovery and effect estimation, aligning the discovery objective with the effects to be estimated. We show that definite non-ancestors of the targets are unnecessary to learn causal relations between the targets and to identify efficient adjustments sets. We sequentially identify and prune these definite non-ancestors with our Sequential Non-Ancestor Pruning (SNAP) framework, which can be used either as a preprocessing step to standard causal discovery methods, or as a standalone sound and complete causal discovery algorithm. Our results on synthetic and real data show that both approaches substantially reduce the number of independence tests and the computation time without compromising the quality of causal effect estimations.
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > California (0.04)
- (4 more...)
K-Level Reasoning with Large Language Models
Zhang, Yadong, Mao, Shaoguang, Ge, Tao, Wang, Xun, Xia, Yan, Lan, Man, Wei, Furu
While Large Language Models (LLMs) have demonstrated their proficiency in complex reasoning tasks, their performance in dynamic, interactive, and competitive scenarios - such as business strategy and stock market analysis - remains underexplored. To bridge this gap, we formally explore the dynamic reasoning capabilities of LLMs for decision-making in rapidly evolving environments. We introduce two game theory-based pilot challenges that mirror the complexities of real-world dynamic decision-making. These challenges are well-defined, enabling clear, controllable, and precise evaluation of LLMs' dynamic reasoning abilities. Through extensive experiments, we find that existing reasoning methods tend to falter in dynamic settings that require k-level thinking - a key concept not tackled by previous works. To address this, we propose a novel reasoning approach for LLMs, named "K-Level Reasoning". This approach adopts the perspective of rivals to recursively employ k-level thinking based on available historical information, which significantly improves the prediction accuracy of rivals' subsequent moves and informs more strategic decision-making. This research not only sets a robust quantitative benchmark for the assessment of dynamic reasoning but also markedly enhances the proficiency of LLMs in dynamic contexts.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China (0.04)
Large-scale Generative AI Models Lack Visual Number Sense
Testolin, Alberto, Hou, Kuinan, Zorzi, Marco
Humans can readily judge the number of objects in a visual scene, even without counting, and such a skill has been documented in a variety of animal species and in babies prior to language development and formal schooling. Numerical judgments are error-free for small sets, while for larger collections responses become approximate, with variability increasing proportionally to the target number. This response pattern is observed for items of all kinds, despite variation in object features (such as color or shape), suggesting that our visual number sense relies on abstract representations of numerosity. Here, we investigated whether generative Artificial Intelligence (AI) models based on large-scale transformer architectures can reliably name the number of objects in simple visual stimuli or generate images containing a target number of items in the 1-10 range. Surprisingly, none of the foundation models considered performed in a human-like way: They all made striking errors even with small numbers, the response variability often did not increase in a systematic way, and the pattern of errors varied with object category. Our findings demonstrate that advanced AI systems still lack a basic ability that supports an intuitive understanding of numbers, which in humans is foundational for numeracy and mathematical development.