Goto

Collaborating Authors

 preview


ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

Chittepu, Yaswanth, Addanki, Raghavendra, Mai, Tung, Rao, Anup, Kveton, Branislav

arXiv.org Artificial Intelligence

The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and 15 tabular ML challenges from Kaggle. Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management, allowing agents to flexibly name, save, and retrieve intermediate results throughout the workflows. We demonstrate that standard ReAct-style approaches struggle to generate valid tool sequences for complex ML pipelines, and that tree search methods with LLM-based evaluation underperform due to inconsistent state scoring. To address these limitations, we propose two simple approaches: 1) using shaped deterministic rewards with structured textual feedback, and 2) decomposing the original problem into a sequence of sub-tasks, which significantly improves trajectory validity and task performance. Using GPT-4o, our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges. We believe our work provides a foundation for developing more capable tool-augmented planning ML agents.


A Compliance-Preserving Retrieval System for Aircraft MRO Task Search

Jo, Byungho

arXiv.org Artificial Intelligence

Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.


Human-Adversarial Visual Question Answering (Supplementary Material) A Training Details

Neural Information Processing Systems

We use a batch size of 64 for 236K updates using a multi-step learning rate scheduler with steps at 180K and 216K, learning rate ratio of 0.2 and a warmup for 54K updates. The training takes an average of 8 hours. The training takes an average of 17 hours. We set the batch size to 8, weight decay as 1 e 4 and train the model on 8 GPUs for 2 days. MLM loss using a batch size of 64 which takes an average of 13 hours.


Preview, Accept or Discard? A Predictive Low-Motion Interaction Paradigm

Berengueres, Jose

arXiv.org Artificial Intelligence

Repetitive strain injury (RSI) affects roughly one in five computer users and remains largely unresolved despite decades of ergonomic mouse redesign. All such devices share a fundamental limitation: they still require fine-motor motion to operate. This work investigates whether predictive, AI-assisted input can reduce that motion by replacing physical pointing with ranked on-screen suggestions. To preserve user agency, we introduce Preview Accept Discard (PAD), a zero-click interaction paradigm that lets users preview predicted GUI targets, cycle through a small set of ranked alternatives, and accept or discard them via key-release timing. We evaluate PAD in two settings: a browser-based email client and a ISO 9241-9 keyboard-prediction task under varying top-3 accuracies. Across both studies, PAD substantially reduces hand motion relative to trackpad use while maintaining comparable task times with the trackpad only when accuracies are similar to those of the best spell-checkers.


iOS 26 adds a new app to your iPhone. Here's how to use it.

Popular Science

DIY Tech Hacks iOS 26 adds a new app to your iPhone. Here's how to use it. You're not imagining it--there is a new app on your iPhone. Breakthroughs, discoveries, and DIY tips sent every weekday. Apple's big iOS 26 software update for 2025 has now reached millions of iPhones, and brought with it a bunch of new features and an updated visual interface.


Unlock this AI feature in Firefox and never fall for a scam link again

PCWorld

When you purchase through links in our articles, we may earn a small commission. AI-powered link previews are a great way to see ahead so you don't end up clicking on malicious links. Starting with version 138 (released back in April), Firefox has had a new-yet-still-deactivated option that uses "artificial intelligence" to display a mini preview of the destination page for a link. The feature determines the content of the page in question and displays a pop-up, and this preview can help to avoid potential scams and malware when navigating unsolicited links. The AI feature works locally on your PC and, according to Mozilla, doesn't use a cloud service.


Oops! Google's unannounced new Nest Cams spotted in Google Home app

PCWorld

The big smart home manufacturers have been leaking like sieves as of late, giving us juicy early previews of their super-secret upcoming releases. Philips Hue recently fell victim to its own leak that revealed its entire fall product lineup, and now Google appears to have unwittingly shared images of its new Nest cam hardware. First, a quick recap: Google had already teased--intentionally--a new Gemini smart speaker during its Pixel event a couple of weeks back, and just days ago it promised an upcoming Google Home update on October 1, complete with a partial image of what appears to be a new Nest camera. Instead, it seems Google may have inadvertently left images of its new Nest hardware in the Google Home app following a recent update. The images, which were spotted by Android Authority and appear to have been subsequently yanked from the app, don't reveal anything startlingly new about the new Nest cams, aside from the fact that they exist.


AHELM: A Holistic Evaluation of Audio-Language Models

Lee, Tony, Tu, Haoqin, Wong, Chi Heem, Wang, Zijun, Yang, Siwei, Mai, Yifan, Zhou, Yuyin, Xie, Cihang, Liang, Percy

arXiv.org Artificial Intelligence

Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.



SciDA: Scientific Dynamic Assessor of LLMs

Zhou, Junting, Miao, Tingjia, Liao, Yiyan, Wang, Qichao, Wen, Zhoufutu, Wang, Yanqin, Huang, Yunjie, Yan, Ge, Wang, Leqi, Xia, Yucheng, Gao, Hongwan, Zeng, Yuansong, Zheng, Renjie, Dun, Chen, Liang, Yitao, Yang, Tong, Huang, Wenhao, Zhang, Ge

arXiv.org Artificial Intelligence

Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA