AITopics | current directory

Collaborating Authors

current directory

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Neural Information Processing SystemsJun-15-2026, 14:48:21 GMT

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Overview (0.92)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FunReason-MT Technical Report: Advanced Data Synthesis Solution for Real-world Multi-Turn Tool-use

Xu, Zengzhuang, Hao, Bingguang, Wang, Zechuan, Wen, Yuntao, Xu, Xinyi, Liu, Yang, Chen, Long, Wang, Dong, Wang, Maolin, Zhao, Tong, Chen, Yicheng, Peng, Cunyin, Gu, Jinjie, Gan, Leilei, Zhao, Xiangyu, Zhuang, Chenyi, Gu, Shi

arXiv.org Artificial IntelligenceNov-18-2025

Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted data synthesis, hard query construction, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories with targeted tool, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.24645

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Badertdinov, Ibragim, Golubev, Alexander, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Andriushchenko, Andrei, Trofimova, Maria, Litvintseva, Daria, Yangel, Boris

arXiv.org Artificial IntelligenceNov-5-2025

large language model, machine learning, urlhttp, (21 more...)

arXiv.org Artificial Intelligence

2505.20411

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Golubev, Alexander, Trofimova, Maria, Polezhaev, Sergei, Badertdinov, Ibragim, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Abramov, Sergey, Andriushchenko, Andrei, Fisin, Filipp, Skvortsov, Sergei, Yangel, Boris

arXiv.org Artificial IntelligenceOct-14-2025

Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.

large language model, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2508.03501

Country: Europe > Austria (0.28)

Genre:

Workflow (0.46)
Research Report (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
(2 more...)

Add feedback

From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

Kim, Gyeongwon James, Wilf, Alex, Morency, Louis-Philippe, Fried, Daniel

arXiv.org Artificial IntelligenceJun-25-2025

Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can implement scientific ideas when given varied amounts of code as a starting point, interpolating between reproduction (running code) and from-scratch replication (fully re-implementing and running code). We introduce AutoExperiment, a benchmark that evaluates AI agents' ability to implement and run machine learning experiments based on natural language descriptions in research papers. In each task, agents are given a research paper, a codebase with key functions masked out, and a command to run the experiment. The goal is to generate the missing code, execute the experiment in a sandboxed environment, and reproduce the results. AutoExperiment scales in difficulty by varying the number of missing functions $n$, ranging from partial reproduction to full replication. We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases. Agents that can dynamically interact with the environment (e.g. to debug their code) can outperform agents in fixed "agentless" harnesses, and there exists a significant gap between single-shot and multi-trial success rates (Pass@1 vs. Pass@5), motivating verifier approaches to our benchmark. Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution, establishing AutoExperiment as a new benchmark for evaluating progress in AI-driven scientific experimentation. Our data and code are open-sourced at https://github.com/j1mk1m/AutoExperiment .

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.19724

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre:

Research Report > New Finding (1.00)
Personal > Honors > Award (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal

Aggarwal, Vaibhav, Kamal, Ojasv, Japesh, Abhinav, Jin, Zhijing, Schölkopf, Bernhard

arXiv.org Artificial IntelligenceMar-18-2025

Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.

large language model, machine learning, trajectory, (20 more...)

arXiv.org Artificial Intelligence

2503.14269

Country:

Asia > Middle East > Syria > Daraa Governorate > Dar'a (0.24)
North America > Canada > Ontario > Toronto (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

Abramovich, Talor, Udeshi, Meet, Shao, Minghao, Lieret, Kilian, Xi, Haoran, Milner, Kimberly, Jancheska, Sofija, Yang, John, Jimenez, Carlos E., Khorrami, Farshad, Krishnamurthy, Prashanth, Dolan-Gavitt, Brendan, Shafique, Muhammad, Narasimhan, Karthik, Karri, Ramesh, Press, Ofir

arXiv.org Artificial IntelligenceSep-24-2024

Although language model (LM) agents are demonstrating growing potential in many domains, their success in cybersecurity has been limited due to simplistic design and the lack of fundamental features for this domain. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. EnIGMA introduces new Agent-Computer Interfaces (ACIs) to improve the success rate on CTF challenges. We establish the novel Interactive Agent Tool concept, which enables LM agents to run interactive command-line utilities essential for these challenges. Empirical analysis of EnIGMA on over 350 CTF challenges from three different benchmarks indicates that providing a robust set of new tools with demonstration of their usage helps the LM solve complex problems and achieves state-of-the-art results on the NYU CTF and Intercode-CTF benchmarks. Finally, we discuss insights on ACI design and agent behavior on cybersecurity tasks that highlight the need to adapt real-world tools for LM agents.

agent, current directory, interactive session, (13 more...)

arXiv.org Artificial Intelligence

2409.16165

Country:

North America > United States > New York (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > Middle East > Egypt > North Sinai Governorate > Arish (0.04)
(5 more...)

Genre:

Research Report (0.82)
Workflow (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (1.00)
Government > Military > Cyberwarfare (0.70)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)

Add feedback

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang, John, Jimenez, Carlos E., Wettig, Alexander, Lieret, Kilian, Yao, Shunyu, Narasimhan, Karthik, Press, Ofir

arXiv.org Artificial IntelligenceMay-30-2024

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

agent, current directory, derivative, (12 more...)

arXiv.org Artificial Intelligence

2405.15793

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Europe > Monaco (0.04)
(4 more...)

Genre:

Research Report (1.00)
Workflow (0.92)

Industry: Information Technology > Security & Privacy (0.92)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(3 more...)

Add feedback

Command Line Interface (CLI) for Deep Learning Applications

#artificialintelligenceApr-28-2022, 00:05:14 GMT

I bet that you have already seen in movies the IT guy hacking a system by writing commands inside a black window and thought "How cool is that!". Well, in reality, things are not that easy to hack but we do have some basic commands that can help interact with the computer, which is called command-line interface (CLI). The command-line interface is a program on your computer that allows you to create and delete files, run programs, and navigate through folders and files. On a Mac and Linux Systems, it's called Terminal, and on Windows, it's Command Prompt. CLI is not just a fancy method to interact with your computer.

deep learning application, folder, new folder, (13 more...)

#artificialintelligence

Industry: Information Technology (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software Engineering (0.83)
Information Technology > Human Computer Interaction > Interfaces (0.82)

Add feedback

BashPitfalls - Greg's Wiki

#artificialintelligenceSep-8-2020, 06:30:53 GMT

This page is a compilation of common mistakes made by bash users. Each example is flawed in some way. Yes, it would be great if you could just treat the output of ls or find as a list of filenames and iterate over it. This entire approach is fatally flawed, and there is no trick that can make it work. You must use an entirely different approach. If a filename contains whitespace, it undergoes WordSplitting. Assuming we have a file named 01 - Don't Eat the Yellow Snow.mp3 in the current directory, the for loop will iterate over each word in the resulting file name: 01, -, Don't, Eat, etc. If a filename contains glob characters, it undergoes filename expansion ("globbing"). If ls produces any output containing a * character, the word containing it will become recognized as a pattern and substituted with a list of all filenames that match it. If the command substitution returns multiple filenames, there is no way to tell where the first one ends and the second one begins. Pathnames may contain any character except NUL. Depending on which platform you're on, which arguments you used (or didn't use), and whether its standard output is pointing to a terminal or not, ls may randomly decide to replace certain characters in a filename with "?", or simply not print them at all. Never try to parse the output of ls. It's an external command whose output is intended specifically to be read by a human, not parsed by a script. That may seem desirable since ls adds a newline, but if the last filename in the list ends with a newline, ... or $() will remove that one also. In the ls examples, if the first filename starts with a hyphen, it may lead to pitfall #3. This causes the entire output of ls to be treated as a single word. Instead of iterating over each file name, the loop will only execute once, assigning to f a string with all the filenames rammed together. Nor can you simply change IFS to a newline.

argument, artificial intelligence, expansion, (18 more...)

#artificialintelligence

Industry: Energy > Oil & Gas (1.00)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence (0.69)
Information Technology > Communications > Collaboration (0.40)

Add feedback