Adjara
TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use
He, Pengfei, Dai, Zhenwei, He, Bing, Liu, Hui, Tang, Xianfeng, Lu, Hanqing, Li, Juanhui, Ding, Jiayuan, Mukherjee, Subhabrata, Wang, Suhang, Xing, Yue, Tang, Jiliang, Dumoulin, Benoit
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.
- Europe > Austria > Vienna (0.14)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- Europe > France (0.04)
- (35 more...)
- Leisure & Entertainment (1.00)
- Consumer Products & Services > Travel (1.00)
- Media > Music (0.96)
- (2 more...)
Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data
Carslaw, Iona, Milton, Sivan, Navarre, Nicolas, Qing, Ciyang, Uegaki, Wataru
For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- (7 more...)
Speed, Quality, and the Optimal Timing of Complex Decisions: Field Evidence
Sunde, Uwe, Zegners, Dainis, Strittmatter, Anthony
This paper presents an empirical investigation of the relation between decision speed and decision quality for a real-world setting of cognitively-demanding decisions in which the timing of decisions is endogenous: professional chess. Move-by-move data provide exceptionally detailed and precise information about decision times and decision quality, based on a comparison of actual decisions to a computational benchmark of best moves constructed using the artificial intelligence of a chess engine. The results reveal that faster decisions are associated with better performance. The findings are consistent with the predictions of procedural decision models like drift-diffusion-models in which decision makers sequentially acquire information about decision alternatives with uncertain valuations.
- Asia > Russia > Ural Federal District > Tyumen Oblast > Khanty-Mansi Autonomous Okrug > Khanty-Mansiysk (0.04)
- South America > Paraguay (0.04)
- Europe > Norway (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.69)