Goto

Collaborating Authors

 concurrency


SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data

Pradhan, Bidyapati, Dasgupta, Surajit, Saha, Amit Kumar, Anustoop, Omkar, Puttagunta, Sriram, Mittal, Vipul, Sarda, Gopal

arXiv.org Artificial Intelligence

The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.


Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Wang, Dong, Li, Yang, Ni, Ansong, Yeh, Ching-Feng, Emad, Youssef, Lei, Xinjie, Robbins, Liam, Padthe, Karthik, Xu, Hu, Li, Xian, Celikyilmaz, Asli, Raghavendra, Ramya, Huang, Lifei, Wu, Carole-Jean, Li, Shang-Wen

arXiv.org Artificial Intelligence

Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.


Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

Kolluru, Saicharan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, from conversational AI to code generation and content creation [1, 2, 3]. However, the deployment of these models in production environments presents significant engineering challenges. The computational demands of autoregressive text generation, combined with the massive parameter counts of modern LLMs, necessitate specialized serving infrastructure that can efficiently manage GPU resources while meeting application-specific performance requirements. The serving infrastructure for LLMs must address several competing objectives: maximizing throughput to serve many concurrent users, minimizing latency for responsive user experiences, and efficiently utilizing expensive GPU resources. Different applications prioritize these objectives differently--a chatbot requires low latency for individual requests, while a batch document processing system prioritizes throughput. This variation in requirements has led to the development of specialized serving frameworks, each making different design trade-offs. Among the available open-source solutions, vLLM [4] and HuggingFace Text Generation Inference (TGI) [5] have emerged as leading frameworks, widely adopted in both research and production settings.


MURMUR: Using cross-user chatter to break collaborative language agents in groups

Patlan, Atharv Singh, Sheng, Peiyao, Hebbar, S. Ashwin, Mittal, Prateek, Viswanath, Pramod

arXiv.org Artificial Intelligence

Language agents are rapidly expanding from single-user assistants to multi-user collaborators in shared workspaces and groups. However, today's language models lack a mechanism for isolating user interactions and concurrent tasks, creating a new attack vector inherent to this new setting: cross-user poisoning (CUP). In a CUP attack, an adversary injects ordinary-looking messages that poison the persistent, shared state, which later triggers the agent to execute unintended, attacker-specified actions on behalf of benign users. We validate CUP on real systems, successfully attacking popular multi-user agents. To study the phenomenon systematically, we present MURMUR, a framework that composes single-user tasks into concurrent, group-based scenarios using an LLM to generate realistic, history-aware user interactions. We observe that CUP attacks succeed at high rates and their effects persist across multiple tasks, thus posing fundamental risks to multi-user LLM deployments. Finally, we introduce a first-step defense with task-based clustering to mitigate this new class of vulnerability



Beyond Benchmarks: The Economics of AI Inference

Zhuang, Boqin, Qiao, Jiacheng, Liu, Mingqian, Yu, Mingxing, Hong, Ping, Li, Rui, Song, Xiaoxia, Xu, Xiangjun, Chen, Xu, Ma, Yaoyao, Gao, Yujie

arXiv.org Artificial Intelligence

The inference cost of Large Language Models (LLMs) has become a critical factor in determining their commercial viability and widespread adoption. This paper introduces a quantitative ``economics of inference'' framework, treating the LLM inference process as a compute-driven intelligent production activity. We analyze its marginal cost, economies of scale, and quality of output under various performance configurations. Based on empirical data from WiNEval-3.0, we construct the first ``LLM Inference Production Frontier,'' revealing three principles: diminishing marginal cost, diminishing returns to scale, and an optimal cost-effectiveness zone. This paper not only provides an economic basis for model deployment decisions but also lays an empirical foundation for the future market-based pricing and optimization of AI inference resources.


Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with Nest

Huo, Yuqian, Quiroga, David, Kyrillidis, Anastasios, Patel, Tirthak

arXiv.org Artificial Intelligence

Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, Nest, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use Nest to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.


FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning

Zhang, Yizhou, Lv, Ning, Wang, Teng, Dang, Jisheng

arXiv.org Artificial Intelligence

Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO Group relative policy optimization (GRPO) has recently emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs) through reinforcement learning Team (2025a). In each training iteration, the LLM generates a group of responses to a given query. These responses are subsequently evaluated using a predefined rule-based reward function, and the resulting rewards are standardized prior to model updates via policy optimization Shao et al. (2024).


Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Chen, Peng, Zhang, Jiaji, Zhao, Hailiang, Zhang, Yirong, Yu, Jiahong, Tang, Xueyan, Wang, Yixuan, Li, Hao, Zou, Jianping, Xiong, Gang, Chow, Kingsum, He, Shuibing, Deng, Shuiguang

arXiv.org Artificial Intelligence

In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.


Scalable Offline ASR for Command-Style Dictation in Courtrooms

Nethil, Kumarmanas, Mishra, Vaibhav, Anandan, Kriti, Manohar, Kavya

arXiv.org Artificial Intelligence

We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and transcribes these segments in parallel using Whisper models, enabling efficient multiplexing across audios. Unlike proprietary systems like SuperWhisper, this framework is also compatible with most ASR architectures, including widely used CTC-based models. Our multiplexing technique maximizes compute utilization in real-world settings, as demonstrated by its deployment in around 15% of India's courtrooms. Evaluations on live data show consistent latency reduction as user concurrency increases, compared to sequential batch processing. The live demonstration will showcase our open-sourced implementation and allow attendees to interact with it in real-time.