mip
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
Optimization Algorithms
A.1 Proof of Monotonicity and Submodularity In Equation (3a), we stated the objective of the knapsack cover to be Remark 1. f+M is monotonically increasing. A.2 Knapsack Cover To find a solution to problem 3, we use the greedy algorithm proposed by Badanidiyuru and Vondrรกk [2], which deals with submodular maximization subject to a system of lknapsack constraints and with pmatroid constraints. We present an adapted version of the algorithm in Algorithm 2 where l = 1. Theparameter allows us to 16 trade-off solution time and solution quality. In this work, we set = 0.2.
A Greedy Approach for Budgeted Maximum Inner Product Search
Maximum Inner Product Search (MIPS) is an important task in many machine learning applications such as the prediction phase of low-rank matrix factorization models and deep learning models. Recently, there has been substantial research on how to perform MIPS in sub-linear time, but most of the existing work does not have the flexibility to control the trade-off between search efficiency and search quality. In this paper, we study the important problem of MIPS with a computational budget. By carefully studying the problem structure of MIPS, we develop a novel Greedy-MIPS algorithm, which can handle budgeted MIPS by design. While simple and intuitive, Greedy-MIPS yields surprisingly superior performance compared to state-of-the-art approaches. As a specific example, on a candidate set containing half a million vectors of dimension 200, Greedy-MIPS runs 200x faster than the naive approach while yielding search results with the top-5 precision greater than 75%.
Much Ado About Noising: Dispelling the Myths of Generative Robotic Control
Pan, Chaoyi, Anantharaman, Giri, Huang, Nai-Chieh, Jin, Claire, Pfrommer, Daniel, Yuan, Chenyang, Permenter, Frank, Qu, Guannan, Boffi, Nicholas, Shi, Guanya, Simchowitz, Max
Long-horizon, dexterous manipulation tasks such as furniture assembly, food preparation, and manufacturing have been a holy grail in robotics. Recent large robot action models (T eam et al., 2025; Black et al., 2024; Kim et al., 2024) have made substantial breakthroughs towards these goals by imitating expert demonstrations of diverse qualities. We provide a more comprehensive review of related work in Section 6, but highlight here a key trend: while supervised learning from demonstration, also known as behavior cloning (BC), has been applied across domains for decades (Pomerleau, 1988), its recent success in robotic manipulation has coincided with the adoption of what we term generative control policies (GCPs): robotic control policies that use generative modeling architectures, such as diffusion models, flow models, and autoregressive transformers, as parameterizations of the mapping from observation to action. Given the seemingly transformative nature of GCPs for robot learning, there has been much speculation about the origin of their superior performance relative to policies trained with a regression loss, henceforth regression control policies (RCPs). GCPs, by modeling conditional distributions over actions, are uniquely suited to the multi-task pretraining paradigm popular in today's large robotic models.