execution environment
AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems
Piao, Yun, Min, Hongbo, Su, Hang, Zhang, Leilei, Wang, Lei, Yin, Yue, Wu, Xiao, Xu, Zhejing, Qu, Liwei, Li, Hang, Zeng, Xinxin, Tian, Wei, Yu, Fei, Li, Xiaowei, Jiang, Jiayi, Liu, Tongxu, Tian, Hao, Que, Yufei, Tu, Xiaobing, Suo, Bing, Li, Yuebing, Chen, Xiangting, Zhao, Zeen, Tang, Jiaming, Huang, Wei, Li, Xuguang, Zhao, Jing, Li, Jin, Shen, Jie, Ren, Jinkui, Zhang, Xiantao
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.
Quantum-like Coherence Derived from the Interaction between Chemical Reaction and Its Environment
Gunji, Yukio-Pegio, Adamatzky, Andrew, Mougkogiannis, Panagiotis, Khrenikov, Andrei
By uncovering the contrast between Artificial Intelligence and Natural-born Intelligence as a computational process, we define closed computing and open computing, and implement open computing within chemical reactions. This involves forming a mixture and invalidation of the computational process and the execution environment, which are logically distinct, and coalescing both to create a system that adjusts fluctuations. We model chemical reactions by considering the computation as the chemical reaction and the execution environment as the degree of aggregation of molecules that interact with the reactive environment. This results in a chemical reaction that progresses while repeatedly clustering and de-clustering, where concentration no longer holds significant meaning. Open computing is segmented into Token computing, which focuses on the individual behavior of chemical molecules, and Type computing, which focuses on normative behavior. Ultimately, both are constructed as an interplay between the two. In this system, Token computing demonstrates self-organizing critical phenomena, while Type computing exhibits quantum logic. Through their interplay, the recruitment of fluctuations is realized, giving rise to interactions between quantum logical subspaces corresponding to quantum coherence across different Hilbert spaces. As a result, spike waves are formed, enabling signal transmission. This occurrence may be termed quantum-like coherence, implying the source of enzymes responsible for controlling spike waves and biochemical rhythms.
SWE-smith: Scaling Data for Software Engineering Agents
Yang, John, Lieret, Kilian, Jimenez, Carlos E., Wettig, Alexander, Khandpur, Kabir, Zhang, Yanzhe, Hui, Binyuan, Press, Ofir, Schmidt, Ludwig, Yang, Diyi
Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.
Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
Boruch-Gruszecki, Aleksander, Zi, Yangtian, Wu, Zixuan, Oberoi, Tejas, Anderson, Carolyn Jane, Biswas, Joydeep, Guha, Arjun
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.
SWE-bench Goes Live!
Zhang, Linghao, He, Shilin, Zhang, Chaoyun, Kang, Yu, Li, Bowen, Xie, Chengxing, Wang, Junhao, Wang, Maoquan, Huang, Yufan, Fu, Shengyu, Nallipogu, Elsie, Lin, Qingwei, Dang, Yingnong, Rajmohan, Saravan, Zhang, Dongmei
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
The Next Frontier of LLM Applications: Open Ecosystems and Hardware Synergy
Hou, Xinyi, Zhao, Yanjie, Wang, Haoyu
The second paradigm involves LLM agents developed using frameworks like LangChain [16], AutoGPT [11], Langroid [18], AutoGen [23], and LlamaIndex [22], which offer greater programmability and modularity, allowing developers to build sophisticated, multi-agent systems that integrate external tools and dynamic workflows [20]. Despite their advantages, both paradigms remain architecturally fragmented and lack standardized interoperability, leading to redundant development efforts and constrained scalability. From a software engineering (SE) perspective, current LLM application paradigms resemble traditional platform-centric software ecosystems, where applications are tightly coupled to proprietary APIs and execution environments. LLM app stores, while lowering the barrier to entry, impose constraints on extensibility and cross-platform interoperability, leading to vendor lock-in and duplicated development efforts across different ecosystems. In contrast, agent-based LLM frameworks provide modularity but lack standardized mechanisms for component reuse and integration, making it challenging to compose LLM applications that seamlessly operate across heterogeneous environments. This fragmentation mirrors historical challenges in SE, where monolithic architectures have given way to service-oriented and microservices-based designs to improve reusability, scalability, and maintainability. Another key limitation of existing LLM applications is inefficient hardware utilization.
NDAI Agreements
Stephenson, Matthew, Miller, Andrew, Sun, Xyn, Annem, Bhargav, Parikh, Rohan
We study a fundamental challenge in the economics of innovation: an inventor must reveal details of a new idea to secure compensation or funding, yet such disclosure risks expropriation. We present a model in which a seller (inventor) and buyer (investor) bargain over an information good under the threat of hold-up. In the classical setting, the seller withholds disclosure to avoid misappropriation, leading to inefficiency. We show that trusted execution environments (TEEs) combined with AI agents can mitigate and even fully eliminate this hold-up problem. By delegating the disclosure and payment decisions to tamper-proof programs, the seller can safely reveal the invention without risking expropriation, achieving full disclosure and an efficient ex post transfer. Moreover, even if the invention's value exceeds a threshold that TEEs can fully secure, partial disclosure still improves outcomes compared to no disclosure. Recognizing that real AI agents are imperfect, we model "agent errors" in payments or disclosures and demonstrate that budget caps and acceptance thresholds suffice to preserve most of the efficiency gains. Our results imply that cryptographic or hardware-based solutions can function as an "ironclad NDA," substantially mitigating the fundamental disclosure-appropriation paradox first identified by Arrow (1962) and Nelson (1959). This has far-reaching policy implications for fostering R&D, technology transfer, and collaboration.
A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data
Salve, Aniruddha, Attar, Saba, Deshmukh, Mahesh, Shivpuje, Sayali, Utsab, Arnab Mitra
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain-specific data into the generative process. While LLMs are highly capable, they often rely on static, pre-trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single-agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.
Asynchronous Tool Usage for Real-Time Agents
Ginart, Antonio A., Kodali, Naveen, Lee, Jason, Xiong, Caiming, Savarese, Silvio, Emmons, John
While frontier large language models (LLMs) are capable tool-using agents, current AI systems still operate in a strict turn-based fashion, oblivious to passage of time. This synchronous design forces user queries and tool-use to occur sequentially, preventing the systems from multitasking and reducing interactivity. To address this limitation, we introduce asynchronous AI agents capable of parallel processing and real-time tool-use. Our key contribution is an event-driven finite-state machine architecture for agent execution and prompting, integrated with automatic speech recognition and text-to-speech. Drawing inspiration from the concepts originally developed for real-time operating systems, this work presents both a conceptual framework and practical tools for creating AI agents capable of fluid, multitasking interactions.
Incorporation of Verifier Functionality in the Software for Operations and Network Attack Results Review and the Autonomous Penetration Testing System
Milbrath, Jordan, Straub, Jeremy
The software for operations and network attack results review (SONARR) and the autonomous penetration testing system (APTS) use facts and common properties in digital twin networks to represent real-world entities. However, in some cases fact values will change regularly, making it difficult for objects in SONARR and APTS to consistently and accurately represent their real-world counterparts. This paper proposes and evaluates the addition of verifiers, which check real-world conditions and update network facts, to SONARR. This inclusion allows SONARR to retrieve fact values from its executing environment and update its network, providing a consistent method of ensuring that the operations and, therefore, the results align with the real-world systems being assessed. Verifiers allow arbitrary scripts and dynamic arguments to be added to normal SONARR operations. This provides a layer of flexibility and consistency that results in more reliable output from the software.