Goto

Collaborating Authors

 harness


Why the World's Best AI Systems Are Still So Bad at Pokémon

TIME - Tech

Why the World's Best AI Systems Are Still So Bad at Pokémon Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. Right now, live on Twitch, you can watch three of the world's smartest AI systems-- GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro --doing their best to beat classic Pokémon games. At least by human standards, they are not very good. The systems are slow, overconfident, and often confused.


HARNESS: Human-Agent Risk Navigation and Event Safety System for Proactive Hazard Forecasting in High-Risk DOE Environments

Elgedawy, Ran, Das, Sanjay, Seefried, Ethan, Wiggins, Gavin, Burchfield, Ryan, Hewit, Dana, Srinivasan, Sudarshan, Thomas, Todd, Balaprakash, Prasanna, Ghosal, Tirthankar

arXiv.org Artificial Intelligence

Operational safety at mission-critical work sites is a top priority given the complex and hazardous nature of daily tasks. This paper presents the Human-Agent Risk Navigation and Event Safety System (HARNESS), a modular AI framework designed to forecast hazardous events and analyze operational risks in U.S. Department of Energy (DOE) environments. HARNESS integrates Large Language Models (LLMs) with structured work data, historical event retrieval, and risk analysis to proactively identify potential hazards. A human-in-the-loop mechanism allows subject matter experts (SMEs) to refine predictions, creating an adaptive learning loop that enhances performance over time. By combining SME collaboration with iterative agentic reasoning, HARNESS improves the reliability and efficiency of predictive safety systems. Preliminary deployment shows promising results, with future work focusing on quantitative evaluation of accuracy, SME agreement, and decision latency reduction.


deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Harnesses

Androutsopoulos, Georgios, Bianchi, Antonio

arXiv.org Artificial Intelligence

Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library. Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. We evaluated deepSURF on 63 real-world Rust crates, successfully rediscovering 30 known memory safety bugs and uncovering 12 previously-unknown vulnerabilities (out of which 11 have been assigned RustSec IDs and 3 have been patched), demonstrating clear improvements over state-of-the-art tools.


Could a self-monitoring system for criminals replace prisons one day?

New Scientist

Could a self-monitoring system for criminals replace prisons one day? Future Chronicles is our regular speculative look at inventions yet to come. In this latest installment, we journey to 2050, when technology had been developed so that criminals could be monitored at home. "It's no surprise that the first countries to abolish prisons were Scandinavian " In the 2020s, the US was spending an eye-watering $182 billion a year on locking up its citizens. No other country imprisoned as many people or spent as much in doing so.


Feline stressed: Experts urge cat owners NOT to take their pets out for walks on trendy harnesses - amid fears they leave kitties feeling 'scared'

Daily Mail - Science & tech

Shroud of Turin mystery deepens as surgeon spots hidden detail that points to Jesus' resurrection I was so happy after trying a trendy new cosmetic procedure. But 10 years later I suffered a devastating side effect... the doctor had lied I'm no longer sleeping with my husband - and never will again, says MOLLY RYDDELL. I love him, but counted down the moments until he climaxed. Then I couldn't bear it any more and the truth spilled out... so many women feel the same The'middle-class kinks' saving marriages: Wives reveal the eight buzzy sex trends that revived their lagging libidos - including the fantasy husbands are secretly obsessed with Lori Loughlin's husband Mossimo Giannulli seen with mystery brunette in tiny skirt day after shock split I'm a woman with autism... here are the signs you might be masking, even from yourself Cake-faced 90s sitcom star looks unrecognizable as she ditches the heavy eyeshadow for an LA errand run can you guess who? Trump dollar coin design released by Treasury... and it's inspired by the most iconic political photo of the century I've loved Taylor Swift for years. Mystery deepens over Hulk Hogan's death as his widow faces fresh anguish Body count from Houston's bayous rises as serial killer whispers grip city and residents are told: 'Be vigilant' Prison chief reveals exactly where Diddy could end up... and the one horrifying jail he MUST avoid Diddy sentenced to 50 MONTHS in prison for prostitution offenses as he's branded a vile and unrepentant woman beater Feline stressed: Experts urge cat owners NOT to take their pets out for walks on trendy harnesses - amid fears they leave kitties feeling'scared' A cat charity has urged owners not to use trendy harnesses on their cats - amid fears they leave felines feeling scared.


Orion: Fuzzing Workflow Automation

Bazalii, Max, Fleischer, Marius

arXiv.org Artificial Intelligence

Fuzz testing is one of the most effective techniques for finding software vulnerabilities. While modern fuzzers can generate inputs and monitor executions automatically, the overall workflow, from analyzing a codebase, to configuring harnesses, to triaging results, still requires substantial manual effort. Prior attempts focused on single stages such as harness synthesis or input minimization, leaving researchers to manually connect the pieces into a complete fuzzing campaign. We introduce Orion, a framework that automates the the manual bottlenecks of fuzzing by integrating LLM reasoning with traditional tools, allowing campaigns to scale to settings where human effort alone was impractical. Orion uses LLMs for code reasoning and semantic guidance, while relying on deterministic tools for verification, iterative refinement, and tasks that require precision. Across our benchmark suite, Orion reduces human effort by 46-204x depending on the workflow stage, and we demonstrate its effectiveness through the discovery of two previously unknown vulnerabilities in the widely used open-source clib library.


PentestJudge: Judging Agent Behavior Against Operational Requirements

Caldwell, Shane, Harley, Max, Kouremetis, Michael, Abruzzo, Vincent, Pearce, Will

arXiv.org Artificial Intelligence

We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent's actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. We evaluate several frontier and open-source models acting as judge agents, with the best model reaching an F1 score of 0.83. We find models that are better at tool-use perform more closely to human experts. By stratifying the F1 scores by requirement type, we find even models with similar overall scores struggle with different types of questions, suggesting certain models may be better judges of particular operating criteria. We find that weaker and cheaper models can judge the trajectories of pentests performed by stronger and more expensive models, suggesting verification may be easier than generation for the penetration testing task. We share this methodology to facilitate future research in understanding the ability of judges to holistically and scalably evaluate the process quality of AI-based information security agents so that they may be confidently used in sensitive production environments.


Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

Pinckney, Nathaniel, Deng, Chenhui, Ho, Chia-Tung, Tsai, Yun-Da, Liu, Mingjie, Zhou, Wenfei, Khailany, Brucek, Ren, Haoxing

arXiv.org Artificial Intelligence

We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks$\unicode{x2013}$especially those involving RTL reuse and verification$\unicode{x2013}$are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. CVDP reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.


As the US and China lock horns, Malaysia hopes to harness an AI revolution

Al Jazeera

Kulim, Malaysia – When tech giant AT&S decided a few years ago that it needed to ramp up production to keep pace with the artificial intelligence (AI) boom, it did not look to its largest manufacturing facilities in China. The Austrian firm's plants in Chongqing and Shanghai – opened in 2022 and 2016, respectively – employ some 9,000 workers between them, churning out high-end components used in everything from consumer electronics to cars. But AT&S was at the same time coming to grips with the risks of concentrating production in one country. Like many tech firms grappling with the disruption of the COVID-19 pandemic and the trade war salvoes between the United States and China, AT&S decided it needed to diversify its supply chains. Malaysia quickly emerged at the top of the company's list of potential locations for its next plant.


KHAIT: K-9 Handler Artificial Intelligence Teaming for Collaborative Sensemaking

Wilchek, Matthew, Wang, Linhan, Dickinson, Sally, Feuerbacher, Erica, Luther, Kurt, Batarseh, Feras A.

arXiv.org Artificial Intelligence

In urban search and rescue (USAR) operations, communication between handlers and specially trained canines is crucial but often complicated by challenging environments and the specific behaviors canines are trained to exhibit when detecting a person. Since a USAR canine often works out of sight of the handler, the handler lacks awareness of the canine's location and situation, known as the 'sensemaking gap.' In this paper, we propose KHAIT, a novel approach to close the sensemaking gap and enhance USAR effectiveness by integrating object detection-based Artificial Intelligence (AI) and Augmented Reality (AR). Equipped with AI-powered cameras, edge computing, and AR headsets, KHAIT enables precise and rapid object detection from a canine's perspective, improving survivor localization. We evaluate this approach in a real-world USAR environment, demonstrating an average survival allocation time decrease of 22%, enhancing the speed and accuracy of operations.