Goto

Collaborating Authors

 Software



'Pretty Crazy' Token Usage Is Testing Bosses' Bet on AI

WIRED

'Pretty Crazy' Token Usage Is Testing Bosses' Bet on AI A Silicon Valley software maker and an ecommerce company reveal to WIRED how they are navigating the emerging challenge of "tokenomics." At the software company 8x8, employees are using Anthropic's Claude to draft emails, analyze customer feedback, and write code, but so far, their growing reliance on the artificial intelligence chatbot hasn't troubled the finance team. While other Silicon Valley companies, such as Meta, Uber, and Salesforce, have publicly expressed concerns about the growing cost of generative AI tools and have begun introducing usage caps in some cases, 8x8 says it finds itself in the black. Over the past 18 months, the company estimates it has saved about $5 million in annual costs by canceling subscriptions to dozens of software and educational tools it deemed unnecessary in part because Claude could provide similar capabilities. So far, 8x8's annualized bill for Claude is "well below" that figure, says Joel Neeb, the company's chief transformation and business operations officer.


VideoCAD: ADataset and Model for Learning Long-Horizon 3DCADUIInteractions from Video

Neural Information Processing Systems

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VIDEOCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VIDEOCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20 longer than those in other datasets. We show two important downstream applications of VIDEOCAD: (1) learning UI interactions from professional 3DCAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VIDEOCADFORMER, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VIDEOCADFORMER and the VQA benchmark derived from VIDEOCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.


Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Neural Information Processing Systems

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWORLD-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset JEDI, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on JEDI demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWORLD-G. Furthermore, we demonstrate that improved grounding with JEDI directly enhances agentic capabilities of general foundation models on complex computer tasks with state-of-the-art performance, improving from 23% to 51% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces.


TheAgentCompany: Benchmarking LLMAgents on Consequential Real World Tasks

Neural Information Processing Systems

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents-in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.


New Moms Are Returning to Coding Jobs Radically Reshaped by AI

WIRED

New mothers working in software development are staring down an AI-pilled workplace they barely recognize. As Danielle settled into the rhythms of new motherhood, her profession underwent a drastic reinvention. Danielle, who asked to use her first name to avoid damaging her job prospects, worked as a software developer at a car company in Portland, Oregon. Before she left the workforce in mid-2024, barely anybody used AI to write code; by the time she was ready to return, a year later, it had become the expectation. Once upon a time, she had been drawn to coding for the job security it offered, but AI was threatening to upend that.


You're probably missing these 13 useful Google Chrome tools

PCWorld

PCWorld highlights 13 underutilized Google Chrome features that can significantly enhance browsing productivity and organization for billions of users. Key tools include tab groups for organization, cross-device syncing, Guest profiles for temporary use, and keyboard shortcuts like Ctrl+Shift+T to reopen closed tabs. These hidden features offer powerful customization options through Chrome flags, multiple user profiles, dark mode settings, and extension management for improved daily web interaction. Around two-thirds of all internet users use Google Chrome, according to StatCounter .


Demis Hassabis Thinks AI Job Cuts Are Dumb

WIRED

The CEO of Google DeepMind tells WIRED that companies should use the productivity gains of AI to do more, not lay people off. Demis Hassabis, the CEO of Google DeepMind, is keen to talk about the coding skills of his company's newest model, Gemini 3.5 Flash. The model has been trained to perform complex agentic coding tasks: translate large code bases from one language to another; find and fix bugs lurking deep in knotty code; and even write entire operating systems from scratch. Hassabis does not, however, think this spells doom for software developers. "I have no idea why people are going around talking with certainty about that," Hassabis tells WIRED ahead of the new model reveal at today's Google's I/O event .


Gemini in Chrome arrives on Android devices in June

Engadget

Google is bringing Gemini in Chrome to Android devices. During the Android Show: I/O Edition livestream on Tuesday, the company announced that it would release the chatbot integration in June. Once it arrives, Android users will see a new Gemini icon at the top right of the toolbar. Tapping it will bring up a chat interface from the bottom of the screen. Despite the switch to a smaller form factor, the majority of Gemini in Chrome features Google offers on PCs are accounted for in this new release.


CUDA Proves Nvidia Is a Software Company

WIRED

There's a deep, forbidding moat that surrounds Nvidia--and it has nothing to do with hardware. Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I'm afraid I must talk about "moats." Popularized decades ago by Warren Buffett to refer to a company's competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled "We Have No Moat, and Neither Does OpenAI," fretted that open-source AI would pillage Big Tech's castle. A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models.