sandbox
AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems
Piao, Yun, Min, Hongbo, Su, Hang, Zhang, Leilei, Wang, Lei, Yin, Yue, Wu, Xiao, Xu, Zhejing, Qu, Liwei, Li, Hang, Zeng, Xinxin, Tian, Wei, Yu, Fei, Li, Xiaowei, Jiang, Jiayi, Liu, Tongxu, Tian, Hao, Que, Yufei, Tu, Xiaobing, Suo, Bing, Li, Yuebing, Chen, Xiangting, Zhao, Zeen, Tang, Jiaming, Huang, Wei, Li, Xuguang, Zhao, Jing, Li, Jin, Shen, Jie, Ren, Jinkui, Zhang, Xiantao
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.
- Research Report (0.82)
- Overview (0.68)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.68)
A Multimodal Conversational Agent for Tabular Data Analysis
Awad, Mohammad Nour Al, Ivanov, Sergey, Tikhonova, Olga, Khodnenko, Ivan
Abstract--Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the T alk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the T alk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants. Interacting with data often requires programming skills or statistical expertise, creating barriers for managers, analysts, and other non-technical users [1], [2]. Natural language interfaces (NLIs) aim to improve this information seeking process by allowing users to query data conversationally [3], [4]. At the same time, voice interfaces are becoming increasingly common in daily life, yet existing voice assistants remain limited: they can answer factual questions or control devices, but they lack the analytical capabilities needed for meaningful data exploration.
- Asia > Russia (0.14)
- Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.05)
- North America > United States > Texas > Andrews County (0.04)
Position Paper: If Innovation in AI Systematically Violates Fundamental Rights, Is It Innovation at All?
Castañeira, Josu Eguiluz, Brando, Axel, Laukyte, Migle, Serra-Vidal, Marc
Artificial intelligence (AI) now permeates critical infrastructures and decision-making systems where failures produce social, economic, and democratic harm. This position paper challenges the entrenched belief that regulation and innovation are opposites. As evidenced by analogies from aviation, pharmaceuticals, and welfare systems and recent cases of synthetic misinformation, bias and unaccountable decision-making, the absence of well-designed regulation has already created immeasurable damage. Regulation, when thoughtful and adaptive, is not a brake on innovation -- it is its foundation. The present position paper examines the EU AI Act as a model of risk-based, responsibility-driven regulation that addresses the Collingridge Dilemma: acting early enough to prevent harm, yet flexibly enough to sustain innovation. Its adaptive mechanisms -- regulatory sandboxes, small and medium enterprises (SMEs) support, real-world testing, fundamental rights impact assessment (FRIA) -- demonstrate how regulation can accelerate responsibly, rather than delay, technological progress. The position paper summarises how governance tools transform perceived burdens into tangible advantages: legal certainty, consumer trust, and ethical competitiveness. Ultimately, the paper reframes progress: innovation and regulation advance together. By embedding transparency, impact assessments, accountability, and AI literacy into design and deployment, the EU framework defines what responsible innovation truly means -- technological ambition disciplined by democratic values and fundamental rights.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Slovakia (0.04)
- (12 more...)
- Research Report (0.64)
- Overview (0.46)
- Transportation > Air (1.00)
- Law > Statutes (1.00)
- Law Enforcement & Public Safety (1.00)
- (5 more...)
Copilot AI's latest trick? A secure sandbox for its agentic activity
When you purchase through links in our articles, we may earn a small commission. Microsoft 365 users can now test Researcher with Computer Use, an autonomous agent that can access files that it couldn't before. Microsoft Copilot is tapping a key feature from Windows 11 Pro to enable Copilot's AI to dig even further than it already has. It's part of an update to Microsoft 365 Copilot called Researcher with Computer Use, debuting today for a limited subset of Microsoft 365 Copilot users. LLMs that engage in deep research, like Copilot, face a problem: some content is locked away behind an authentication process, like requiring a password.
Scalable Supervising Software Agents with Patch Reasoner
Xu, Junjielong, Tan, Boyin, Liu, Xiaoyuan, Peng, Chao, Gao, Pengfei, He, Pinjia
While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Chen, Zehao, Pan, Rong, Li, Haoran
Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- (7 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.54)
- Government (0.93)
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Supplementary Material
Table 6: Mercury-eval encompasses 256 tasks, the difficulty of which has been balanced for model evaluation. Specifically, there are two primary constraints: a time limit and a memory limit. The memory limit caps the amount of RAM that a process can consume. The sandbox employs an isolated file system to provide a safe execution environment for the code. This is done to prevent code from using certain functions directly from the host's system libraries, which could result in unpredictable behavior Figure 5 shows the overview of the code execution pipeline.
The Precautionary Principle and the Innovation Principle: Incompatible Guides for AI Innovation Governance?
In policy debates concerning the governance and regulation of Artificial Intelligence (AI), both the Precautionary Principle (PP) and the Innovation Principle (IP) are advocated by their respective interest groups. Do these principles offer wholly incompatible and contradictory guidance? Does one necessarily negate the other? I argue here that provided attention is restricted to weak-form PP and IP, the answer to both of these questions is "No." The essence of these weak formulations is the requirement to fully account for type-I error costs arising from erroneously preventing the innovation's diffusion through society (i.e. mistaken regulatory red-lighting) as well as the type-II error costs arising from erroneously allowing the innovation to diffuse through society (i.e. mistaken regulatory green-lighting). Within the Signal Detection Theory (SDT) model developed here, weak-PP red-light (weak-IP green-light) determinations are optimal for sufficiently small (large) ratios of expected type-I to type-II error costs. For intermediate expected cost ratios, an amber-light 'wait-and-monitor' policy is optimal. Regulatory sandbox instruments allow AI testing and experimentation to take place within a structured environment of limited duration and societal scale, whereby the expected cost ratio falls within the 'wait-and-monitor' range. Through sandboxing regulators and innovating firms learn more about the expected cost ratio, and what respective adaptations -- of regulation, of technical solution, of business model, or combination thereof, if any -- are needed to keep the ratio out of the weak-PP red-light zone. Nevertheless AI foundation models are ill-suited for regulatory sandboxing as their general-purpose nature precludes credible identification of misclassification costs.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (27 more...)
- Law > Statutes (1.00)
- Law > Environmental Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (6 more...)
Virtual Agent Economies
Tomasev, Nenad, Franklin, Matija, Leibo, Joel Z., Jacobs, Julian, Cunningham, William A., Gabriel, Iason, Osindero, Simon
The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the "sandbox economy" as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI "mission economies" to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity's long-term collective flourishing.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (9 more...)
- Social Sector (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (5 more...)