AITopics | directory

Collaborating Authors

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

d8c6a37c4c94e9a63e53d296f1f668ae-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-17-2026, 10:08:41 GMT

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Natick (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)
Law (0.93)
(2 more...)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Roig, JV

arXiv.org Artificial IntelligenceDec-10-2025

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

granite 4, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2512.07497

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Yang, Zonghan, Wang, Shengjie, Fu, Kelin, He, Wenyang, Xiong, Weimin, Liu, Yibo, Miao, Yibo, Gao, Bofei, Wang, Yejie, Ma, Yingwei, Li, Yanhao, Liu, Yue, Hu, Zhenxing, Zhang, Kaitai, Wang, Shuyi, Chen, Huarong, Sung, Flood, Liu, Yang, Gao, Yang, Yang, Zhilin, Liu, Tianyu

arXiv.org Artificial IntelligenceDec-9-2025

A contiguous chunk of lines to search for in the existing sourcecode 4. The dividing line: =======5. The lines to replace into the source code6. The end of the replace block: >>>>>>> REPLACEHere is an example: '''python ### mathweb/flask/app.py<<<<<<< SEARCH from flask import Flask ======= import math from flask import Flask >>>>>>> REPLACE ''' Please note that the * SEARCH/REPLACE * edit REQUIRES PROPER INDENTATION.If you would like to add the line ' print(x)', you mustfully write that out, with all those spaces before the code!Wrap the * SEARCH/REPLACE * edit in blocks '''python...'''.The summary of the key differences between the trajectories should bein the thinking part.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2509.23045

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
(2 more...)

Add feedback

CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Kim, Hyeonjae, Li, Chenyue, Deng, Wen, Jin, Mengxi, Huang, Wen, Lu, Mengqian, Yuan, Binhang

arXiv.org Artificial IntelligenceNov-26-2025

Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20109

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
(6 more...)

Genre:

Workflow (1.00)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

NetworkGym: Reinforcement Learning Environments

Neural Information Processing SystemsNov-20-2025, 03:23:41 GMT

We make use of four internal 12 GB NVIDIA TIT AN Xp GPUs to perform our experiments. At initialization of each environment, four UEs are randomly stationed 1.5 meters above the The L TE base station lies at ( x, z) = (40 m, 3m) . We use random seed values from 0 to 63, inclusive, for this parameter. Do not distribute. of four We train PTD3 for 10,000 steps, instead of 1,000,000 steps, which we do for TD3+BC.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > California > San Diego County > San Diego (0.05)

Industry: Education (0.51)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

Add feedback

FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use

Xu, Zengzhuang, Hao, Bingguang, Wang, Zechuan, Wen, Yuntao, Xu, Xinyi, Liu, Yang, Chen, Long, Wang, Dong, Wang, Maolin, Zhao, Tong, Chen, Yicheng, Peng, Cunyin, Gu, Jinjie, Gan, Leilei, Zhao, Xiangyu, Zhuang, Chenyi, Gu, Shi

arXiv.org Artificial IntelligenceNov-18-2025

Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted data synthesis, hard query construction, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories with targeted tool, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.24645

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

3be60b4a739b95a07a944a1a2c41e05e-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsNov-14-2025, 00:57:41 GMT

artificial intelligence, biomarker, machine learning, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Harris County > Houston (0.14)
Asia > China (0.04)
North America > United States > Ohio > Cuyahoga County > Cleveland (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Strength High (0.68)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Simpliflow: A Lightweight Open-Source Framework for Rapid Creation and Deployment of Generative Agentic AI Workflows

Panchal, Deven

arXiv.org Artificial IntelligenceNov-13-2025

Generative Agentic AI systems are emerging as a powerful paradigm for automating complex, multi-step tasks. However, many existing frameworks for building these systems introduce significant complexity, a steep learning curve, and substantial boilerplate code, hindering rapid prototyping and deployment. This paper introduces simpliflow, a lightweight, open-source Python framework designed to address these challenges. simpliflow enables the rapid development and orchestration of linear, deterministic agentic workflows through a declarative, JSON-based configuration. Its modular architecture decouples agent management, workflow execution, and post-processing, promoting ease of use and extensibility. By integrating with LiteLLM, it supports over 100 Large Language Models (LLMs) out-of-the-box. We present the architecture, operational flow, and core features of simpliflow, demonstrating its utility through diverse use cases ranging from software development simulation to real-time system interaction. A comparative analysis with prominent frameworks like LangChain and AutoGen highlights simpliflow's unique position as a tool optimized for simplicity, control, and speed in deterministic workflow environments.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.10675

Country: North America > United States (0.04)

Genre: Workflow (1.00)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Add feedback

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Wu, Yunze, Fu, Dayuan, Si, Weiye, Huang, Zhen, Jiang, Mohan, Li, Keyu, Xia, Shijie, Sun, Jie, Xu, Tianze, Hu, Xiangkun, Lu, Pengrui, Cai, Xiaojie, Ye, Lyumanshan, Zhu, Wenhong, Xiao, Yang, Liu, Pengfei

arXiv.org Artificial IntelligenceNov-4-2025

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.27598

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.45)

Industry: Information Technology > Software (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Li, Junlong, Zhao, Wenshuo, Zhao, Jian, Zeng, Weihao, Wu, Haoze, Wang, Xiaochen, Ge, Rui, Cao, Yuxuan, Huang, Yuzhen, Liu, Wei, Liu, Junteng, Su, Zhaochen, Guo, Yiyang, Zhou, Fan, Zhang, Lueyang, Michelini, Juan, Wang, Xingyao, Yue, Xiang, Zhou, Shuyan, Neubig, Graham, He, Junxian

arXiv.org Artificial IntelligenceOct-30-2025

Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.25726

Country:

North America > United States > Pennsylvania (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry:

Banking & Finance (0.67)
Information Technology > Services (0.46)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback