Goto

Collaborating Authors

 debug




Agent-Environment Alignment via Automated Interface Generation

arXiv.org Artificial Intelligence

Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as \textbf{agent-environment misalignment}. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose \textbf{ALIGN}, an \underline{A}uto-A\underline{l}igned \underline{I}nterface \underline{G}e\underline{n}eration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67\% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP-MT/ALIGN.


debug-gym: A Text-Based Environment for Interactive Debugging

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.


CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

arXiv.org Artificial Intelligence

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.


Application-oriented automatic hyperparameter optimization for spiking neural network prototyping

arXiv.org Artificial Intelligence

Hyperparameter optimization (HPO) is of paramount importance in the development of high-performance, specialized artificial intelligence (AI) models, ranging from well-established machine learning (ML) solutions to the deep learning (DL) domain and the field of spiking neural networks (SNNs). The latter introduce further complexity due to the neuronal computational units and their additional hyperparameters, whose inadequate setting can dramatically impact the final model performance. At the cost of possible reduced generalization capabilities, the most suitable strategy to fully disclose the power of SNNs is to adopt an application-oriented approach and perform extensive HPO experiments. To facilitate these operations, automatic pipelines are fundamental, and their configuration is crucial. In this document, the Neural Network Intelligence (NNI) toolkit is used as reference framework to present one such solution, with a use case example providing evidence of the corresponding results. In addition, a summary of published works employing the presented pipeline is reported as possible source of insights into application-oriented HPO experiments for SNN prototyping.


Domain Expansion and Boundary Growth for Open-Set Single-Source Domain Generalization

arXiv.org Artificial Intelligence

Open-set single-source domain generalization aims to use a single-source domain to learn a robust model that can be generalized to unknown target domains with both domain shifts and label shifts. The scarcity of the source domain and the unknown data distribution of the target domain pose a great challenge for domain-invariant feature learning and unknown class recognition. In this paper, we propose a novel learning approach based on domain expansion and boundary growth to expand the scarce source samples and enlarge the boundaries across the known classes that indirectly broaden the boundary between the known and unknown classes. Specifically, we achieve domain expansion by employing both background suppression and style augmentation on the source data to synthesize new samples. Then we force the model to distill consistent knowledge from the synthesized samples so that the model can learn domain-invariant information. Furthermore, we realize boundary growth across classes by using edge maps as an additional modality of samples when training multi-binary classifiers. In this way, it enlarges the boundary between the inliers and outliers, and consequently improves the unknown class recognition during open-set generalization. Extensive experiments show that our approach can achieve significant improvements and reach state-of-the-art performance on several cross-domain image classification datasets.


The AI-Powered Future of Coding Is Near

WIRED

I am by no means a skilled coder, but thanks to a free program called SWE-agent, I was just able to debug and fix a gnarly problem involving a misnamed file within different code repositories on the software-hosting site GitHub. I pointed SWE-agent at an issue on GitHub and watched as it went through the code and reasoned about what might be wrong. It correctly determined that the root cause of the bug was a line that pointed to the wrong location for a file, then navigated through the project, located the file, and amended the code so that everything ran properly. It's the kind of thing that an inexperienced developer (such as myself) might spend hours trying to debug. Many coders already use artificial intelligence to write software more quickly.


Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

arXiv.org Artificial Intelligence

To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.


The Quality of Auto-Generated Code

#artificialintelligence

Kevlin Henney and I were riffing on some ideas about GitHub Copilot, the tool for automatically generating code base on GPT-3's language model, trained on the body of code that's in GitHub. This article poses some questions and (perhaps) some answers, without trying to present any conclusions. First, we wondered about code quality. There are lots of ways to solve a given programming problem; but most of us have some ideas about what makes code "good" or "bad." Is it readable, is it well-organized?