Goto

Collaborating Authors

 claude code


Google Shakes Up Its Browser Agent Team Amid OpenClaw Craze

WIRED

As Silicon Valley obsesses over a new wave of AI coding agents, Google and other AI labs are shifting their bets. Google is shaking up the team behind Project Mariner, its AI agent that can navigate the Chrome browser and complete tasks on a user's behalf, WIRED has learned. In recent months, some Google Labs staffers who worked on the research prototype have moved on to higher-priority projects, according to two people familiar with the matter. A Google spokesperson confirmed the changes, but said the computer use capabilities developed under Project Mariner will be incorporated into the company's agent strategy moving forward. Google has already folded some of these capabilities into other agent products, including the recently launched Gemini Agent, the spokesperson added.


Vibe coding apps taught me how hard real coding is

PCWorld

PCWorld explores the reality of "vibe coding" with AI tools, where the author attempted to build four apps using Claude Code and Google's Antigravity. Only one Docker Swarm dashboard succeeded after a week of effort, while three OpenClaw replications failed due to vague prompts and poor planning. The experience reveals that AI-assisted development still requires significant human creativity, detailed blueprints, and specific instructions to avoid "garbage in, garbage out" results. Like so many others, I jumped onto the vibe coding bandwagon, entranced by the idea of building my own incredibly useful apps with nothing but an AI prompt. Over the course of about six weeks, I did manage to build my own apps-four of them, to be precise.


AI Agents Are Taking America by Storm

The Atlantic - Technology

The post-chatbot era has begun. Americans are living in parallel AI universes. For much of the country, AI has come to mean ChatGPT, Google's AI overviews, and the slop that now clogs social-media feeds. Meanwhile, tech hobbyists are becoming radicalized by bots that can work for hours on end, collapsing months of work into weeks, or weeks into an afternoon. Recently, more people have started to play around with tools such as Claude Code .


Rules fail at the prompt, succeed at the boundary

MIT Technology Review

From the Gemini Calendar prompt-injection attack of 2026 to the September 2025 state-sponsored hack using Anthropic's Claude code as an automated intrusion engine, the coercion of human-in-the-loop agentic actions and fully autonomous agentic workflows are the new attack vector for hackers. In the Anthropic case, roughly 30 organizations across tech, finance, manufacturing, and government were affected. Anthropic's threat team assessed that the attackers used AI to carry out 80% to 90% of the operation: reconnaissance, exploit development, credential harvesting, lateral movement, and data exfiltration, with humans stepping in only at a handful of key decision points. This was not a lab demo; it was a live espionage campaign. The attackers hijacked an agentic setup (Claude code plus tools exposed via Model Context Protocol (MCP)) and jailbroke it by decomposing the attack into small, seemingly benign tasks and telling the model it was doing legitimate penetration testing. The same loop that powers developer copilots and internal agents was repurposed as an autonomous cyber-operator.


How Claude Code Is Reshaping Software--and Anthropic

WIRED

WIRED spoke with Boris Cherny, head of Claude Code, about how the viral coding tool is changing the way Anthropic works. Engineers in Silicon Valley have been raving about Anthropic's AI coding tool, Claude Code, for months. But recently, the buzz feels as if it's reached a fever pitch. Earlier this week, I sat down with Boris Cherny, head of Claude Code, to try to understand how the company is meeting this moment. "We built the simplest possible thing," said Cherny. "The craziest thing was learning three months ago that half of the sales team at Anthropic uses Claude Code every week."


AI Is Moving Beyond Chatbots. Claude Cowork Shows What Comes Next

TIME - Tech

AI Is Moving Beyond Chatbots. The DNA file had been gathering dust in Pietro Schirano's computer for years. Then, earlier this month, he gave it to Claude Code--an "agentic coding tool" developed by Anthropic--for analysis. "I'm attaching my raw DNA file from Ancestry DNA," he told the tool. The AI spawned copies of itself on Schirano's computer, each one simulating an expert in a different part of the genome--one expert on cardiovascular disease, another on aging, a third on autoimmune disease.


Move Over, ChatGPT

The Atlantic - Technology

You are about to hear a lot more about Claude Code. Over the holidays, Alex Lieberman had an idea: What if he could create Spotify "Wrapped" for his text messages? Without writing a single line of code, Lieberman, a co-founder of the media outlet, created "iMessage Wrapped"--a web app that analyzed statistical trends across nearly 1 million of his texts. One chart that he showed me compared his use of,,, and --he's an guy. Another listed people he had ghosted.


EvilGenie: A Reward Hacking Benchmark

Gabor, Jonathan, Lynch, Jayson, Rosenfeld, Jonathan

arXiv.org Artificial Intelligence

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.


AI firm claims it stopped Chinese state-sponsored cyber-attack campaign

The Guardian

Anthropic says its coding tool, Claude Code, was manipulated to attack 30 entities. Anthropic says its coding tool, Claude Code, was manipulated to attack 30 entities. Anthropic says financial firms and government agencies were attacked'largely without human intervention' Fri 14 Nov 2025 11.27 ESTLast modified on Fri 14 Nov 2025 12.18 EST A leading artificial intelligence company claims to have stopped a China-backed "cyber espionage" campaign that was able to infiltrate financial firms and government agencies with almost no human oversight. The US-based Anthropic said its coding tool, Claude Code, was "manipulated" by a Chinese state-sponsored group to attack 30 entities around the world in September, achieving a "handful of successful intrusions". This was a "significant escalation" from previous AI-enabled attacks it monitored, it wrote in a blogpost on Thursday, because Claude acted largely independently: 80 to 90% of the operations involved in the attack were performed without a human in the loop.


SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Xu, Jingxuan, Deng, Ken, Li, Weihao, Yu, Songwei, Tang, Huaixi, Huang, Haoyang, Lai, Zhiyi, Zhan, Zizheng, Wu, Yanan, Zhang, Chenchen, Lei, Kepeng, Yao, Yifan, Lei, Xinping, Zhu, Wenqiang, Feng, Zongxian, Li, Han, Xiong, Junqi, Li, Dailin, Gao, Zuchen, Wu, Kun, Xiang, Wen, Zhan, Ziqi, Zhang, Yuanxing, Gong, Wuxuan, Gao, Ziyuan, Wang, Guanxiang, Xue, Yirong, Li, Mengtong, Xie, Mengfei, Zhang, Xiaojiang, Wang, Jinghui, Zhuang, Wenhao, Lin, Zheng, Wang, Huiming, Zhang, Zhaoxiang, Zhang, Yuqun, Zhang, Haotian, Chen, Bin, Liu, Jiaheng

arXiv.org Artificial Intelligence

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.