Goto

Collaborating Authors

 readme


CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

arXiv.org Artificial Intelligence

Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.


The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

arXiv.org Artificial Intelligence

Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.


Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

arXiv.org Artificial Intelligence

Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.


Position: More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research

arXiv.org Artificial Intelligence

Experimental verification and falsification of scholarly work are part of the scientific method's core. To improve the Machine Learning (ML)-communities' ability to verify results from prior work, we argue for more robust software engineering. We estimate the adoption of common engineering best practices by examining repository links from all recently accepted International Conference on Machine Learning (ICML), International Conference on Learning Representations (ICLR) and Neural Information Processing Systems (NeurIPS) papers as well as ICML papers over time. Based on the results, we recommend how we, as a community, can improve reproducibility in ML-research.


CodeS: Natural Language to Code Repository via Multi-Layer Sketch

arXiv.org Artificial Intelligence

The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch. Specifically, CodeS includes three modules: RepoSketcher, FileSketcher, and SketchFiller. RepoSketcher first generates a repository's directory structure for given requirements; FileSketcher then generates a file sketch for each file in the generated structure; SketchFiller finally fills in the details for each function in the generated file sketch. To rigorously assess CodeS on the NL2Repo task, we carry out evaluations through both automated benchmarking and manual feedback analysis. For benchmark-based evaluation, we craft a repository-oriented benchmark, SketchEval, and design an evaluation metric, SketchBLEU. For feedback-based evaluation, we develop a VSCode plugin for CodeS and engage 30 participants in conducting empirical studies. Extensive experiments prove the effectiveness and practicality of CodeS on the NL2Repo task.


README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

arXiv.org Artificial Intelligence

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 20,000 unique medical terms and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation (RAG) method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions


LARCH: Large Language Model-based Automatic Readme Creation with Heuristics

arXiv.org Artificial Intelligence

Writing a readme is a crucial aspect of software development as it plays a vital role in managing and reusing program code. Though it is a pain point for many developers, automatically creating one remains a challenge even with the recent advancements in large language models (LLMs), because it requires generating an abstract description from thousands of lines of code. In this demo paper, we show that LLMs are capable of generating a coherent and factually correct readmes if we can identify a code fragment that is representative of the repository. Building upon this finding, we developed LARCH (LLM-based Automatic Readme Creation with Heuristics) which leverages representative code identification with heuristics and weak supervision. Through human and automated evaluations, we illustrate that LARCH can generate coherent and factually correct readmes in the majority of cases, outperforming a baseline that does not rely on representative code identification. We have made LARCH open-source and provided a cross-platform Visual Studio Code interface and command-line interface, accessible at https://github.com/hitachi-nlp/larch. A demo video showcasing LARCH's capabilities is available at https://youtu.be/ZUKkh5ED-O4.


October 2022: "Top 40" New CRAN Packages

#artificialintelligence

One hundred seventy-four new packages made it to CRAN in October. Here are my “Top 40” selections in sixteen categories: Astronomy, Biology, Business, Computational Methods, Data, Ecology, Finance, Genomics, Mathematics, Machine Learning, Medicine, Pharma, Statistics, Time Series, Utilities, Visualization. Astronomy skylight v1.1: Provides a function to calculate sky illuminance values (in lux) for both the sun and moon. The model is a verbatim translation of the code by Janiczek and DeYoung (1987). There are vignettes for Use and Advanced Use. Biology palaeoverse v1.0.0: Provides tools to support data preparation and exploration for palaeobiological analyses including functions for data cleaning, binning (time and space), summarisation and visualisation with the goals of improving code reproducibility and accessibility and establishing standards for the palaeobiological community. See Jones et al. for details, and the contribution guide to get involved. pirouette v1.6.5: Implements a method to create a Bayesian posterior from a phylogeny that depicts the true evolutionary relationships. See Richèl et al. (2020) for background. There are several vignettes including a Tutorial, a demo, and a guide showing how to use the package in a scientific experiment. Business bupaverse v0.1.0: Facilitates loading the packages comprising the bupaverse, an integrated suite of R packages for handling and analysing business process data, developed by the Business Informatics research group at Hasselt University, Belgium. See the Getting Started Guide. Computational Methods fastWavelets v1.0.1: Provides an Rcpp implementation of the Maximal Overlap Discrete Wavelet Transform (MODWT) and the À Trous Discrete Wavelet Transform. See Quilty & Adamowski (2018) for background and README for examples. gips v1.0.0: Employs the methods described in Graczyk et al. (2022) to find the permutation symmetry group under which the covariance matrix of the data is invariant. See the vignettes Optimizers, Theory, and gips. HomomorphicEncryption v0.1.0: Implements the Brakerski-Fan-Vercauteren (2012), Brakerski-Gentry-Vaikuntanathan (2014), and Cheon-Kim-Kim-Song (2016) schema for fully homomorphic encryption. There are seven short vignettes including BFV, BGV, and CKKS. rxode2random v2.0.9: Implements parallel random number generation. See Wang et al. (2016) and Fidler et al (2019) for background and README for an example.. Data airnow v0.1.0: Provides functions to retrieve U.S. Government AirNow air quality data. See README to get started. amazonadsR v0.1.0: Provides functions to collect data on digital marketing campaigns using the Windsor.ai API. See the tutorial for an example and also look at the related new packages: bingadsR, facebookadsR, googleadsR, instagramadsR, linkedinadsR, pinterestadsR, redditadsR, snapchatadsR, ticktokadsR, twitteradsR. Pablo Sanchez was on a roll in October. congress v0.0.1: Provides functions to download and read data on United States congressional proceedings through the Congress.gov API of the Library of Congress. See README for an example. Ecology canaper v1.0.0: Provides functions to analyze the spatial distribution of biodiversity especially useful in the categorical analysis of neo- and paleo-endemism (CANAPE) as described in Mishler et al. (2014) and for statistical tests to determine the types of endemism that occur in a study area while accounting for the evolutionary relationships of species. There are vignettes on CANAPE, randomization, and parallel computing. EcoEnsemble v1.0.1: Provides functions to fit and sample from the ensemble model described in Spence et al (2018). There is an Introduction and there are two additional vignettes: ExploringPriors and SyntheticData. rTRIPLEXCWFlux v0.2.0: Encodes the carbon uptake submodule and evapotranspiration submodule of the TRIPLEX-CW-Flux model to run the simulation of carbon-water coupling. See Zhou et al. (2008) Monteith (1965) for background and the vignette for examples. stopdetection v0.1.1: Enables stop detection in time stamped trajectory by implementing the Stay Point detection algorithm originally described in Ye (2009) that uses time and distance thresholds to characterize spatial regions as stops. See the vignette for examples. Finance highOrderPortfolios v0.1.0: Implements methods to select portfolios using high order moments to characterize return distributions. See Zhou & Palomar (2021) and Wang et al. (2022) for the theory and the vignette to get started. MSTest v0.1.0: Implements hypothesis testing procedures described in Hansen (1992), Carrasco, Hu, & Ploberger (2014) and Dufour & Luger (2017) that can be used to identify the number of regimes in Markov switching models. See README for an example. Genomics metevalue v0.1.13: Implements the e-value method to correct p-values in omics data association studies. See Hebestreit & Klein (2022) and Akalin et.al (2012) for background and the vignette for an example. SCpubr v1.0.4: Implements a system that provides a streamlined way of generating publication ready plots for known Single-Cell transcriptomics data. Look here for an online reference manual. Mathematics Boov v1.0.0: Provides functions to perform the Boolean operations union, difference and intersection on volumes. Computations are done by the C++ library CGAL. See README for some examples. Also, have a look at the package MinkowskiSum. fitode v0.1.1: Provides methods and functions for fitting ordinary differential equations that use sensitivity equations to compute gradients of ODE trajectories with respect to underlying parameters. See the vignette for details. manifold v0.1.1: Implements operations for Riemannian manifolds including geodesic distance, Riemannian metric, and exponential and logarithm maps, and also incorporates a random object generator on the manifolds. See Dai, Lin, and Müller (2021) for details. Machine Learning SoftBart v1.0.1: Implements the SoftBart model of described by Linero and Yang (2018) with the optional use of a sparsity-inducing prior to allow for variable selection. The vignette contains theory and examples. tidyfit v0.5.1: Extends the tidy data environment with functions to fit and cross validate linear regression and classification algorithms on grouped data. There are several vignettes including Predicting Boston House Prices, Multinomial Classification, and Rolling Window Time Series Regression. Medicine cities v0.1.0: Provides functions to simulate clinical trials and summarize causal effects and treatment policy estimands in the presence of intercurrent events. Have a look at the demo. RCT2 v0.0.1: Implements various statistical methods for designing and analyzing two-stage randomized controlled trials using the methods developed by Imai, Jiang, and Malani (2021) and Imai, Jiang, and Malani (2022). There are vignettes on Interference and Causal Inference. Pharma DTSEA v0.0.3: Implements a novel tool to identify candidate drugs against a particular disease based on the drug target set enrichment analysis. It assumes the most effective drugs are those with a closer affinity in the protein-protein interaction network to the specified disease. See Gómez-Carballa et al. (2022) and Feng et al. (2022) for disease expression profiles, Wishart et al. (2018) and Gaulton et al. (2017) for drug target information, and Kanehisa et al. (2021) for the details of KEGG database. There is a vignette. nlmixr2lib v0.1.0: Provides tools to create model libraries for nlmixr2. Models include pharmacokinetic, pharmacodynamic, and disease models used in pharmacometrics. See the vignette Creating a model library. Statistics aIc v1.0: Implements set of tests for compositional pathologies including for coherence of correlations as suggested by Erb et al. (2020), compositional dominance of distance, compositional perturbation invariance as suggested by (Aitchison (1992) and singularity of the covariation matrix. See the vignette for details and examples. ktweedie v1.0.1: Uses Reproducing Kernel Hilbert Space methods to implement Tweedie compound Poisson gamma models with high-dimensional predictors for the analyses of zero-inflated response variables. See the vignette for examples. missoNet v1.0.0: Implements efficient procedures for fitting conditional graphical lasso models linking predictor variables to response variables or tasks, when the response data may contain missing values. See the vignette for examples. ShalpeyOutlier v0.1.0: Provides methods to use Shapley values to detect, explain, and cell wise impute multivariate outliers. See Mayrhofer and Filzmoser (2022) for details and the vignette for examples. SpatialfdaR v1.0.0: Provides functions to that implement finite element analysis methods to spatial functional data analysis. See Sangalli et al. (2013) and Bernardi et al. (2018) for background and the vignette for an example. Time Series dfms v0..1.3: Provides a user friendly and computationally efficient approach to estimate linear Gaussian dynamic factor models using Kalman filter and EM algorithm methods. See Doz et al. (2011) and Banbura & Modugno (2014) for background and the vignette for examples. Utilities ExclusionTable v1.0.0: Provides functions for creating tables of excluded observations by reporting the number before and after each subset() call together with the number of observations that have been excluded. See the vignette. shiny.tailwind v0.2.2: Allows TailwindCSS to be used in Shiny apps with just-in-time compiling including custom CSS with @apply directive, and custom tailwind configurations. See README for examples. Visualization AlphaHull3D v1.1.0: Provides functions to compute the alpha hull of a set of points (informallly: the shape formed by these points) in 3D space. See README for some visualizations, and also have a look at the related packages MeshesTools, and PolygonSoup. bangladesh v1.0.0: Provides sf objects, shape files, and functions to draw regional chorpleth maps for Bangladesh. See the vignette. ggstats v0.1.0: Provides functions to create forest plots of regression model coefficients along with new statistics to compute proportions, weighted mean and cross-tabulation statistics, as well as new geometries to add alternative background color to a plot. There are vignettes on plotting coefficients and on computing cross-tabulation, custom proportions, and weighted means. jagshelper v0.1.11: Provides tools to streamline Bayesian analyses in JAGSincluding functions for extracting output, streamlining assessment of convergence, and producing summary plots. See the vignette for examples. roughsf v1.0.0: Provides functions to draw maps, including “sketchy”, hand-drawn-like maps using the Javascript library Roughjs. See README for examples.


GitHub - facebookresearch/nle: The NetHack Learning Environment

#artificialintelligence

The NetHack Learning Environment (NLE) is a Reinforcement Learning environment presented at NeurIPS 2020. NLE is based on NetHack 3.6.6 and designed to provide a standard RL interface to the game, and comes with tasks that function as a first step to evaluate agents on this new environment. NetHack is one of the oldest and arguably most impactful videogames in history, as well as being one of the hardest roguelikes currently being played by humans. It is procedurally generated, rich in entities and dynamics, and overall an extremely challenging environment for current state-of-the-art RL agents, while being much cheaper to run compared to other challenging testbeds. Through NLE, we wish to establish NetHack as one of the next challenges for research in decision making and machine learning.


March: "Top 40" New CRAN Packages

#artificialintelligence

Two hundred and six new packages stuck to CRAN in March. Here are my "Top 40" selections in thirteen categories: Computational Methods, Data, Finance, Game Theory, Genomics, Machine Learning, Medicine, Networks, Science, Statistics, Time Series, Utilities, and Visualization. Provides functions to perform 2D Delaunay triangulation, constrained or unconstrained, with the help of the CDT C library. Look here for a list of algorithms. Offers tools for downloading and extracting data from the Copernicus Agrometeorological indicators from 1979 to present derived from reanalysis (AgERAS) dataset.