AITopics

Country:

Europe > Germany > Baden-Württemberg (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > Wales (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance (1.00)
Information Technology (0.93)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
(3 more...)

arXiv.org Artificial IntelligenceDec-4-2025

PARC: An Autonomous Self-Reflective Coding Agent for Robust Execution of Long-Horizon Tasks

Orimo, Yuki, Kurata, Iori, Mori, Hodaka, Okuno, Ryuhei, Sawada, Ryohto, Okanohara, Daisuke

We introduce PARC, a coding agent for the autonomous and robust execution of long-horizon computational tasks. PARC is built on a hierarchical multi-agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self-assessment and self-feedback. This design enables PARC to detect and correct high-level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium-ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end-to-end. In Kaggle-based experiments, starting from minimal natural-language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human-engineered baselines. These results highlight the potential of integrating a hierarchical multi-agent system with self-assessment and self-feedback to enable AI systems capable of independent, large-scale scientific and analytical work.

agent, artificial intelligence, parc, (17 more...)

2512.03549

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Energy (0.68)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

arXiv.org Artificial IntelligenceNov-27-2025

CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Li, Sijie, Sun, Weiwei, Li, Shanda, Talwalkar, Ameet, Yang, Yiming

Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, an multi-agent system designed to actively integrate external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.

large language model, machine learning, natural language, (22 more...)

2506.2064

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Education > Curriculum > Subject-Specific Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 13:12:11 GMT

ae00e5ce7142d02e30a8235ede1ec6fc-Paper-Datasets_and_Benchmarks_Track.pdf

competition, dataset, feature engineering, (14 more...)

Country:

Europe > Germany > Baden-Württemberg (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > Wales (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance (1.00)
Information Technology (0.93)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
(3 more...)

arXiv.org Artificial IntelligenceOct-8-2025

Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research

Liu, Gang, Zhu, Yihan, Chen, Jie, Jiang, Meng

Large language models hold promise as scientific assistants, yet existing agents either rely solely on algorithm evolution or on deep research in isolation, both of which face critical limitations. Pure algorithm evolution, as in AlphaEvolve, depends only on the internal knowledge of LLMs and quickly plateaus in complex domains, while pure deep research proposes ideas without validation, resulting in unrealistic or unimplementable solutions. We present DeepEvolve, an agent that integrates deep research with algorithm evolution, uniting external knowledge retrieval, cross-file code editing, and systematic debugging under a feedback-driven iterative loop. Each iteration not only proposes new hypotheses but also refines, implements, and tests them, avoiding both shallow improvements and unproductive over-refinements. Across nine benchmarks in chemistry, mathematics, biology, materials, and patents, DeepEvolve consistently improves the initial algorithm, producing executable new algorithms with sustained gains. By bridging the gap between unguided evolution and research without grounding, DeepEvolve provides a reliable framework for advancing scientific algorithm discovery. Our code is available at https://github.com/liugangcode/deepevolve.

large language model, machine learning, natural language, (20 more...)

2510.06056

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area (0.96)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceOct-7-2025

ML2B: Multi-Lingual ML Benchmark For AutoML

Trofimova, Ekaterina, Shamina, Zosia, Selifanova, Maria, Zaitsev, Artem, Savchuk, Remi, Minets, Maxim, Ozerova, Daria, Sataev, Emil, Zuenko, Denis, Ustyuzhanin, Andrey E.

Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

large language model, machine learning, natural language, (20 more...)

2509.22768

Country:

Asia (0.68)
North America > United States (0.46)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-2-2025, 16:06:51 GMT

Reviewer

We would appreciate any pointers to additional related work in the updated review.

artificial intelligence, model similarity, natural language, (15 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.31)

Neural Information Processing SystemsAug-16-2025, 10:00:29 GMT

c7bf0b7c1a86d5eb3be2c722cf2cf746-AuthorFeedback.pdf

We thank all the reviewers for their feedback. To address R1's concerns we will attempt to separate this contribution from the broader conceptual review. Thank you for raising these valuable points. After updating our experiments per R4's request (see Sharing dataset was a Kaggle competition and that we used a GBM for our Credit dataset. The game-theoretic solution is the Shapley value [24] (R3).

contribution, dataset, experiment, (16 more...)

Technology:

Information Technology > Game Theory (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.51)

arXiv.org Artificial IntelligenceMay-20-2025

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

Cai, Yifu, Li, Xinyu, Goswami, Mononito, Wiliński, Michał, Welter, Gus, Dubrawski, Artur

We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.

large language model, machine learning, natural language, (20 more...)

2505.13291

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Education > Curriculum > Subject-Specific Education (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

arXiv.org Artificial IntelligenceFeb-18-2025

AIDE: AI-Driven Exploration in the Space of Code

Jiang, Zhengyao, Schmidt, Dominik, Srikanth, Dhruv, Xu, Dixing, Kaplan, Ian, Jacenko, Deniss, Wu, Yuxiang

Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-anderror as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI's MLE-Bench and METR's RE-Bench. The implementation of AIDE is publicly available at https://github.com/WecoAI/aideml.

large language model, machine learning, natural language, (18 more...)

2502.13138

Country: North America > United States (0.68)

Genre: Research Report > Promising Solution (0.88)

Industry:

Education > Curriculum > Subject-Specific Education (0.75)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)