screenshot
OPENCUA: Open Foundations for Computer-Use Agents
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OPENCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AGENTNET, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales.
BTL-UI: Blink-Think-Link Reasoning Modelfor GUIAgent
In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To address this gap, we propose Blink-Think-Link (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTLReward - the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI agents.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWORLD-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset JEDI, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on JEDI demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWORLD-G. Furthermore, we demonstrate that improved grounding with JEDI directly enhances agentic capabilities of general foundation models on complex computer tasks with state-of-the-art performance, improving from 23% to 51% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OSAgents
Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable capabilities, driving significant advancements across a wide range of applications. These models are typically fine-tuned to align with specific objectives, such as being "helpful and harmless" [39]. However, recent work on adversarial attacks has demonstrated that carefully crafted inputs can bypass these alignment safeguards [65, 10, 4, 26, 52]. While such adversarial attacks can elicit harmful responses, the output is usually constrained to text that is not directly actionable, limiting the scope of possible harm. While malicious text outputs are concerning, it remains unclear whether the associated risks exceed those posed by information already accessible through the internet [18].
Hands-On With Gemini Spark: I Gave It Access to My Life and It Friend-Zoned My Boyfriend
I Gave Gemini Spark Access to My Life. Google's new AI agent combed through my emails, documents, and calendar to plan a birthday party and still didn't clock the person most important to me. At its recent I/O developer conference, Google introduced Gemini Spark as an always-on agent that connects to your personal data, completes online tasks, and automates aspects of your daily interactions. It's Google's take on the viral OpenClaw agent that rocked Silicon Valley at the start of 2026. OpenClaw's early adopters handed their entire lives over to an AI agent for messaging and scheduling automation--sometimes with bot-induced mishaps causing embarrassing results.
Gotta catch an MP! Players 'debate' UK politicians in Pokémon-style game
Gotta catch an MP! Players'debate' UK politicians in Pokémon-style game Creator of Politidex hopes free online app will help humanise politics and act as a way of'flipping the narrative' The year is 2016 and Pokémon Go has taken over the world. People are wandering for miles on end, disrupting concerts, and even slamming into poles in their attempts to capture fantastical cartoon creatures. Ten years later, a new generation are flocking to another Pokémon-inspired game. Instead of Pikachu, Charizard and Blastoise, however, players are catching and training up their local politicians in order to build their own political parties. Some MPs are even catching themselves.
Keyboard Shortcuts I Learned From My Cat
Every time my cat Mira walks across a keyboard, I learn a few new Mac and PC keyboard shortcuts I never knew about. All cats love keyboards (but this is not a photo of my cat). My cat Mira is perfect, and has never done anything wrong. She also loves walking on laptop keys--both my MacBook and my wife Kathy's Windows PC . You might think that walking on laptops is an example of Mira doing something wrong. And, in any case, we've both learned a lot about how our computers work because of this.
Deepfakes Are Coming for Your Bank Account
OpenAI made the perfect tool for scammers. Donald Trump is on TikTok doing his morning routine. "Get ready with me for a big day," reads the caption, as the president holds a makeup brush to his cheek. The scene is a still, ostensibly a screenshot of a TikTok clip. Like so much other AI-generated slop coursing through the internet, the image is fake and ridiculous.