invoke
Rn(a) with Rn(a): =nX
In particular, Bai et al. [5], Jin et al. [31] developed the first algorithms to beat the curse of multiple agents in twoplayer zero-sum MGs, while Jin et al. [31], Daskalakis et al. [23], Mao and Ba sar [44], Song et al. [63] further demonstrated how to accomplish the same goal when learning other computationally tractable solution concepts (e.g., coarse correlated equilibria) in general-sum multi-player Markov games. We shall also briefly remark on the prior works that concern RL with a generative model. A key term in the regret bound (36) is a weighted sum of the "variance-style" quantities {Varπk(`k)}. While Var(`k) k`kk2 is orderwise tight in the worst-case scenario for a given iteration k, exploiting the problem-specific variance-type structure across time is crucial in sharpening the horizon dependence in many RL problems(e.g.,Azaretal.[3],Jinetal.[30],Lietal.[41,40]). C.1 Preliminariesandnotation Let us start with some preliminary facts and notation.
Toward Understanding Security Issues in the Model Context Protocol Ecosystem
The Model Context Protocol (MCP) is an emerging open standard that enables AI-powered applications to interact with external tools through structured metadata. A rapidly growing ecosystem has formed around MCP, including a wide range of MCP hosts (i.e., Cursor, Windsurf, Claude Desktop, and Cline), MCP registries (i.e., mcp.so, MCP Market, MCP Store, Pulse MCP, Smithery, and npm), and thousands of community-contributed MCP servers. Although the MCP ecosystem is gaining traction, there has been little systematic study of its architecture and associated security risks. In this paper, we present the first comprehensive security analysis of the MCP ecosystem. We decompose MCP ecosystem into three core components: hosts, registries, and servers, and study the interactions and trust relationships among them. Users search for servers on registries and configure them in the host, which translates LLM-generated output into external tool invocations provided by the servers and executes them. Our qualitative analysis reveals that hosts lack output verification mechanisms for LLM-generated outputs, enabling malicious servers to manipulate model behavior and induce a variety of security threats, including but not limited to sensitive data exfiltration. We uncover a wide range of vulnerabilities that enable attackers to hijack servers, due to the lack of a vetted server submission process in registries. To support our analysis, we collect and analyze a dataset of 67,057 servers from six public registries. Our quantitative analysis demonstrates that a substantial number of servers can be hijacked by attackers. Finally, we propose practical defense strategies for MCP hosts, registries, and users. We responsibly disclosed our findings to affected hosts and registries.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
- Asia (0.04)
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Allegrini, Edoardo, Shreekumar, Ananth, Celik, Z. Berkay
Agentic AI systems, which leverage multiple autonomous agents and Large Language Models (LLMs), are increasingly used to address complex, multi-step tasks. The safety, security, and functionality of these systems are critical, especially in high-stakes applications. However, the current ecosystem of inter-agent communication is fragmented, with protocols such as the Model Context Protocol (MCP) for tool access and the Agent-to-Agent (A2A) protocol for coordination being analyzed in isolation. This fragmentation creates a semantic gap that prevents the rigorous analysis of system properties and introduces risks such as architectural misalignment and exploitable coordination issues. To address these challenges, we introduce a modeling framework for agentic AI systems composed of two foundational models. The first, the host agent model, formalizes the top-level entity that interacts with the user, decomposes tasks, and orchestrates their execution by leveraging external agents and tools. The second, the task lifecycle model, details the states and transitions of individual sub-tasks from creation to completion, providing a fine-grained view of task management and error handling. Together, these models provide a unified semantic framework for reasoning about the behavior of multi-AI agent systems. Grounded in this framework, we define 17 properties for the host agent and 14 for the task lifecycle, categorized into liveness, safety, completeness, and fairness. Expressed in temporal logic, these properties enable formal verification of system behavior, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities. Through this effort, we introduce the first rigorously grounded, domain-agnostic framework for the systematic analysis, design, and deployment of correct, reliable, and robust agentic AI systems.
- Workflow (0.68)
- Research Report (0.64)
A Other related works
Let us discuss in passing additional prior works on learning equilibrium solutions in MARL, which have attracted an explosion of interest in recent years. Roughly speaking, previous NE-finding algorithms for two-player zero-sum Markov games can be categorized into model-based algorithms [52, 79, 43], value-based algorithms [4, 5, 73, 54, 31, 15], and policy-based algorithms [10, 22, 71, 82, 14, 81, 11]. In particular, Bai et al. [5], Jin et al. [31] developed the first algorithms to beat the curse of multiple agents in two-player zero-sum MGs, while Jin et al. [31], Daskalakis et al. [23], Mao and Ba sar [44], Song et al. [63] further demonstrated how to accomplish the same goal when learning other computationally tractable solution concepts (e.g., coarse correlated equilibria) in general-sum multi-player Markov games. The recent works Cui and Du [17, 18], Y an et al. [74] studied how to alleviate the sample size scaling with the number of agents in the presence of offline data, with Cui and Du [18] providing a sample-efficient algorithm that also learns NEs in multi-agent Markov games (despite computational intractability). We shall also briefly remark on the prior works that concern RL with a generative model.
Repairing Language Model Pipelines by Meta Self-Refining Competing Constraints at Runtime
Language Model (LM) pipelines can dynamically refine their outputs against programmatic constraints. However, their effectiveness collapses when faced with competing soft constraints, leading to inefficient backtracking loops where satisfying one constraint violates another. We introduce Meta Self-Refining, a framework that equips LM pipelines with a meta-corrective layer to repair these competitions at runtime/inference-time. Our approach monitors the pipeline's execution history to detect oscillatory failures. Upon detection, it invokes a meta-repairer LM that analyzes the holistic state of the backtracking attempts and synthesizes a strategic instruction to balance the competing requirements. This self-repair instruction guides the original LM out of a failing refining loop towards a successful output. Our results show Meta Self-Refining can successfully repair these loops, leading to more efficient LM programs.
- North America > United States (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
URSA: The Universal Research and Scientific Agent
Grosskopf, Michael, Bent, Russell, Somasundaram, Rahul, Michaud, Isaac, Lui, Arthur, Debardeleben, Nathan, Lawrence, Earl
Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in "agentic" AI has the potential to revolutionize modern science and remove bottlenecks to progress. In this work, we present URSA, a scientific agent ecosystem for accelerating research tasks. URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
- Europe > United Kingdom > England (0.04)
- Workflow (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Energy (0.46)
- Information Technology (0.46)
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents
Gioacchini, Luca, Siracusano, Giuseppe, Sanvito, Davide, Gashteovski, Kiril, Friede, David, Bifulco, Roberto, Lawrence, Carolin
The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest -- a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under https://github.com/nec-research/agentquest.
- Europe > North Macedonia > Skopje Statistical Region > Skopje Municipality > Skopje (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
Load Testing SageMaker Multi-Model Endpoints
Productionizing Machine Learning models is a complicated practice. There's a lot of iteration around different model parameters, hardware configurations, traffic patterns that you will have to test to try to finalize a production grade deployment. Load testing is an essential software engineering practice, but also crucial to apply in the MLOps space to see how performant your model is in a real-world setting. How can we load test? A simple yet highly effective framework is the Python package: Locust. Locust can be used in both a vanilla and distributed mode to simulate up to thousands of Transactions Per Second (TPS).
Globus Automation Services: Research process automation across the space-time continuum
Chard, Ryan, Pruyne, Jim, McKee, Kurt, Bryan, Josh, Raumann, Brigitte, Ananthakrishnan, Rachana, Chard, Kyle, Foster, Ian
Research process automation -- the reliable, efficient, and reproducible execution of linked sets of actions on scientific instruments, computers, data stores, and other resources -- has emerged as an essential element of modern science. We report here on new services within the Globus research data management platform that enable the specification of diverse research processes as reusable sets of actions, \emph{flows}, and the execution of such flows in heterogeneous research environments. To support flows with broad spatial extent (e.g., from scientific instrument to remote data center) and temporal extent (from seconds to weeks), these Globus automation services feature: 1) cloud hosting for reliable execution of even long-lived flows despite sporadic failures; 2) a simple specification and extensible asynchronous action provider API, for defining and executing a wide variety of actions and flows involving heterogeneous resources; 3) an event-driven execution model for automating execution of flows in response to arbitrary events; and 4) a rich security model enabling authorization delegation mechanisms for secure execution of long-running actions across distributed resources. These services permit researchers to outsource and automate the management of a broad range of research tasks to a reliable, scalable, and secure cloud platform. We present use cases for Globus automation services, describe their design and implementation, present microbenchmark studies, and review experiences applying the services in a range of applications.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Kansai > Hyogo Prefecture > Kobe (0.04)
- Workflow (0.70)
- Research Report (0.70)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Energy (0.93)
Bea Stollnitz - Creating batch endpoints in Azure ML
Suppose you've trained a machine learning model to accomplish some task, and you'd now like to provide that model's inference capabilities as a service. Maybe you're writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints -- they provide a simple web-based API for feeding data to your model and getting back inference results. Azure ML currently supports three types of endpoints: batch endpoints, Kubernetes online endpoints, and managed online endpoints. I'm going to focus on batch endpoints in this post, but let me start by explaining how the three types differ. Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage.