kaggle competition
- Europe > Germany > Baden-Württemberg (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > Wales (0.04)
- (2 more...)
- Banking & Finance (1.00)
- Information Technology (0.93)
- Health & Medicine (0.67)
PARC: An Autonomous Self-Reflective Coding Agent for Robust Execution of Long-Horizon Tasks
Orimo, Yuki, Kurata, Iori, Mori, Hodaka, Okuno, Ryuhei, Sawada, Ryohto, Okanohara, Daisuke
We introduce PARC, a coding agent for the autonomous and robust execution of long-horizon computational tasks. PARC is built on a hierarchical multi-agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self-assessment and self-feedback. This design enables PARC to detect and correct high-level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium-ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end-to-end. In Kaggle-based experiments, starting from minimal natural-language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human-engineered baselines. These results highlight the potential of integrating a hierarchical multi-agent system with self-assessment and self-feedback to enable AI systems capable of independent, large-scale scientific and analytical work.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
CoMind: Towards Community-Driven Agents for Machine Learning Engineering
Li, Sijie, Sun, Weiwei, Li, Shanda, Talwalkar, Ameet, Yang, Yiming
Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent's ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, an multi-agent system designed to actively integrate external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.
- North America > United States > New York (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China (0.04)
- Africa > Cameroon > Gulf of Guinea (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Education (0.67)
- Europe > Germany > Baden-Württemberg (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > Wales (0.04)
- (2 more...)
- Banking & Finance (1.00)
- Information Technology (0.93)
- Health & Medicine (0.67)
Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research
Liu, Gang, Zhu, Yihan, Chen, Jie, Jiang, Meng
Large language models hold promise as scientific assistants, yet existing agents either rely solely on algorithm evolution or on deep research in isolation, both of which face critical limitations. Pure algorithm evolution, as in AlphaEvolve, depends only on the internal knowledge of LLMs and quickly plateaus in complex domains, while pure deep research proposes ideas without validation, resulting in unrealistic or unimplementable solutions. We present DeepEvolve, an agent that integrates deep research with algorithm evolution, uniting external knowledge retrieval, cross-file code editing, and systematic debugging under a feedback-driven iterative loop. Each iteration not only proposes new hypotheses but also refines, implements, and tests them, avoiding both shallow improvements and unproductive over-refinements. Across nine benchmarks in chemistry, mathematics, biology, materials, and patents, DeepEvolve consistently improves the initial algorithm, producing executable new algorithms with sustained gains. By bridging the gap between unguided evolution and research without grounding, DeepEvolve provides a reliable framework for advancing scientific algorithm discovery. Our code is available at https://github.com/liugangcode/deepevolve.
- North America > United States > Michigan (0.04)
- Asia > Middle East > Jordan (0.04)
ML2B: Multi-Lingual ML Benchmark For AutoML
Trofimova, Ekaterina, Shamina, Zosia, Selifanova, Maria, Zaitsev, Artem, Savchuk, Remi, Minets, Maxim, Ozerova, Daria, Sataev, Emil, Zuenko, Denis, Ustyuzhanin, Andrey E.
Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.
- Europe > Belarus (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (5 more...)
c7bf0b7c1a86d5eb3be2c722cf2cf746-AuthorFeedback.pdf
We thank all the reviewers for their feedback. To address R1's concerns we will attempt to separate this contribution from the broader conceptual review. Thank you for raising these valuable points. After updating our experiments per R4's request (see Sharing dataset was a Kaggle competition and that we used a GBM for our Credit dataset. The game-theoretic solution is the Shapley value [24] (R3).
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents
Cai, Yifu, Li, Xinyu, Goswami, Mononito, Wiliński, Michał, Welter, Gus, Dubrawski, Artur
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Health & Medicine (1.00)
- Education > Curriculum > Subject-Specific Education (0.90)
AIDE: AI-Driven Exploration in the Space of Code
Jiang, Zhengyao, Schmidt, Dominik, Srikanth, Dhruv, Xu, Dixing, Kaplan, Ian, Jacenko, Deniss, Wu, Yuxiang
Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-anderror as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI's MLE-Bench and METR's RE-Bench. The implementation of AIDE is publicly available at https://github.com/WecoAI/aideml.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Education > Curriculum > Subject-Specific Education (0.75)
- Leisure & Entertainment (0.46)