Goto

Collaborating Authors

 response delay



Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps

Xu, Jiayang, Huang, Xiangjie, Li, Zijie, Meng, Zili

arXiv.org Artificial Intelligence

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.


RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation

Ray, Siddhant, Pan, Rui, Gu, Zhuohan, Du, Kuntai, Ananthanarayanan, Ganesh, Netravali, Ravi, Jiang, Junchen

arXiv.org Artificial Intelligence

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.


Learn to Compress (LtC): Efficient Learning-based Streaming Video Analytics

Alam, Quazi Mishkatul, Haque, Israat, Abu-Ghazaleh, Nael

arXiv.org Artificial Intelligence

Video analytics are often performed as cloud services in edge settings, mainly to offload computation, and also in situations where the results are not directly consumed at the video sensors. Sending high-quality video data from the edge devices can be expensive both in terms of bandwidth and power use. In order to build a streaming video analytics pipeline that makes efficient use of these resources, it is therefore imperative to reduce the size of the video stream. Traditional video compression algorithms are unaware of the semantics of the video, and can be both inefficient and harmful for the analytics performance. In this paper, we introduce LtC, a collaborative framework between the video source and the analytics server, that efficiently learns to reduce the video streams within an analytics pipeline. Specifically, LtC uses the full-fledged analytics algorithm at the server as a teacher to train a lightweight student neural network, which is then deployed at the video source. The student network is trained to comprehend the semantic significance of various regions within the videos, which is used to differentially preserve the crucial regions in high quality while the remaining regions undergo aggressive compression. Furthermore, LtC also incorporates a novel temporal filtering algorithm based on feature-differencing to omit transmitting frames that do not contribute new information. Overall, LtC is able to use 28-35% less bandwidth and has up to 45% shorter response delay compared to recently published state of the art streaming frameworks while achieving similar analytics performance.


High-dimensional, multiscale online changepoint detection

Chen, Yudong, Wang, Tengyao, Samworth, Richard J.

arXiv.org Machine Learning

Modern technology has not only allowed the collection of data sets of unprecedented size, but has also facilitated the real-time monitoring of many types of evolving processes of interest. Wearable health devices, astronomical survey telescopes, self-driving cars and transport network load-tracking systems are just a few examples of new technologies that collect large quantities of streaming data, and that provide new challenges and opportunities for statisticians. Very often, a key feature of interest in the monitoring of a data stream is a changepoint; that is, a moment in time at which the data generating mechanism undergoes a change. Such times often represent events of interest, e.g. a change in heart function, and moreover, the accurate identification of changepoints often facilitates the decomposition of a data stream into stationary segments. Historically, it has tended to be univariate time series that have been monitored and studied, within the well-established field of statistical process control (e.g.


Turn-Taking Based on Information Flow for Fluent Human-Robot Interaction

Thomaz, Andrea L. (Georgia Institute of Technology) | Chao, Crystal (Georgia Institute of Technology)

AI Magazine

Turn-taking is a fundamental part of human communication. Our goal is to devise a turn-taking framework for human-robot interaction that, like the human skill, represents something fundamental about interaction, generic to context or domain. We propose a model of turn-taking, and conduct an experiment with human subjects to inform this model. Our findings from this study suggest that information flow is an integral part of human floor-passing behavior. Following this, we implement autonomous floor relinquishing on a robot and discuss our insights into the nature of a general turn-taking model for human-robot interaction.