AITopics | inference server

Collaborating Authors

inference server

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Verifying LLM Inference to Detect Model Weight Exfiltration

Rinberg, Roy, Karvonen, Adam, Hoover, Alexander, Reuter, Daniel, Warr, Keri

arXiv.org Artificial IntelligenceDec-12-2025

As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of 0.01%, corresponding to a >200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.0262

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.45)

Industry:

Information Technology > Security & Privacy (1.00)
Commercial Services & Supplies > Security & Alarm Services (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Experience Deploying Containerized GenAI Services at an HPC Center

Beltre, Angel M., Ogden, Jeff, Pedretti, Kevin

arXiv.org Artificial IntelligenceSep-30-2025

Generative Artificial Intelligence (GenAI) applications are built from specialized components -- inference servers, object storage, vector and graph databases, and user interfaces -- interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.

large language model, machine learning, platform, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3731599.3767356

2509.20603

Country: North America > United States > New Mexico (0.14)

Genre:

Workflow (0.95)
Research Report (0.64)

Industry:

Information Technology > Services (0.93)
Government > Regional Government > North America Government > United States Government (0.68)
Energy (0.68)

Technology:

Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

Yeo, Gwangoo, Kim, Jiin, Choi, Yujeong, Rhu, Minsoo

arXiv.org Artificial IntelligenceNov-28-2024

NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x improvement in cost-efficiency.

inference server, proceedings, throughput, (13 more...)

arXiv.org Artificial Intelligence

2411.19114

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.92)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software (0.93)

Add feedback

Green My LLM: Studying the key factors affecting the energy consumption of code assistants

Coignion, Tristan, Quinton, Clément, Rouvoy, Romain

arXiv.org Artificial IntelligenceNov-7-2024

In recent years,Large Language Models (LLMs) have significantly improved in generating high-quality code, enabling their integration into developers' Integrated Development Environments (IDEs) as code assistants. These assistants, such as GitHub Copilot, deliver real-time code suggestions and can greatly enhance developers' productivity. However, the environmental impact of these tools, in particular their energy consumption, remains a key concern. This paper investigates the energy consumption of LLM-based code assistants by simulating developer interactions with GitHub Copilot and analyzing various configuration factors. We collected a dataset of development traces from 20 developers and conducted extensive software project development simulations to measure energy usage under different scenarios. Our findings reveal that the energy consumption and performance of code assistants are influenced by various factors, such as the number of concurrent developers, model size, quantization methods, and the use of streaming. Notably, a substantial portion of generation requests made by GitHub Copilot is either canceled or rejected by developers, indicating a potential area for reducing wasted computations. Based on these findings, we share actionable insights into optimizing configurations for different use cases, demonstrating that careful adjustments can lead to significant energy savings.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.11892

Country:

North America > Canada > Ontario > Toronto (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre:

Research Report > New Finding (0.89)
Research Report > Experimental Study (0.68)

Industry: Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Collaborative Inference over Wireless Channels with Feature Differential Privacy

Seif, Mohamed, Nie, Yuqi, Goldsmith, Andrea J., Poor, H. Vincent

arXiv.org Artificial IntelligenceOct-25-2024

Collaborative inference among multiple wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for sensing and computer vision. This approach typically involves a three-stage process: a) data acquisition through sensing, b) feature extraction, and c) feature encoding for transmission. However, transmitting the extracted features poses a significant privacy risk, as sensitive personal data can be exposed during the process. To address this challenge, we propose a novel privacy-preserving collaborative inference mechanism, wherein each edge device in the network secures the privacy of extracted features before transmitting them to a central server for inference. Our approach is designed to achieve two primary objectives: 1) reducing communication overhead and 2) ensuring strict privacy guarantees during feature transmission, while maintaining effective inference performance. Additionally, we introduce an over-the-air pooling scheme specifically designed for classification tasks, which provides formal guarantees on the privacy of transmitted features and establishes a lower bound on classification accuracy.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.19917

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models

Choi, Yujeong, Kim, Jiin, Rhu, Minsoo

arXiv.org Artificial IntelligenceJun-11-2024

With the increasing popularity of recommendation systems (RecSys), the demand for compute resources in datacenters has surged. However, the model-wise resource allocation employed in current RecSys model serving architectures falls short in effectively utilizing resources, leading to sub-optimal total cost of ownership. We propose ElasticRec, a model serving architecture for RecSys providing resource elasticity and high memory efficiency. ElasticRec is based on a microservice-based software architecture for fine-grained resource allocation, tailored to the heterogeneous resource demands of RecSys. Additionally, ElasticRec achieves high memory efficiency via our utility-based resource allocation. Overall, ElasticRec achieves an average 3.3x reduction in memory allocation size and 8.1x increase in memory utility, resulting in an average 1.6x reduction in deployment cost compared to state-of-the-art RecSys inference serving system.

elasticrec, memory consumption, shard, (12 more...)

arXiv.org Artificial Intelligence

2406.06955

Genre: Research Report (0.64)

Industry: Information Technology > Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Institutional Platform for Secure Self-Service Large Language Model Exploration

Bumgardner, V. K. Cody, Klusty, Mitchell A., Logan, W. Vaiden, Armstrong, Samuel E., Hickey, Caylin, Talbert, Jeff

arXiv.org Artificial IntelligenceFeb-1-2024

This paper introduces a user-friendly platform developed by the University of Kentucky Center for Applied AI, designed to make large, customized language models (LLMs) more accessible. By capitalizing on recent advancements in multi-LoRA inference, the system efficiently accommodates custom adapters for a diverse range of users and projects. The paper outlines the system's architecture and key features, encompassing dataset curation, model training, secure inference, and text-based feature extraction. We illustrate the establishment of a tenant-aware computational network using agent-based methods, securely utilizing islands of isolated resources as a unified system. The platform strives to deliver secure LLM services, emphasizing process and data isolation, end-to-end encryption, and role-based resource authentication. This contribution aligns with the overarching goal of enabling simplified access to cutting-edge AI models and technology in support of scientific discovery.

adapter, arxiv preprint arxiv, interface, (14 more...)

arXiv.org Artificial Intelligence

2402.00913

Country: North America > United States > Kentucky > Fayette County > Lexington (0.15)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

Yao, Jinghan, Alnaasan, Nawras, Chen, Tian, Shafi, Aamir, Subramoni, Hari, K., Dhabaleswar, Panda, null

arXiv.org Artificial IntelligenceNov-2-2023

Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.

buffer, fastertransformer, inference, (16 more...)

arXiv.org Artificial Intelligence

2305.13484

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

Choi, Yujeong, Kim, John, Rhu, Minsoo

arXiv.org Artificial IntelligenceFeb-22-2023

While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter. Co-locating multiple workers of a model is an effective way to maximize query-level parallelism and server throughput, but the interference caused by concurrent workers at shared resources can prevent server queries from meeting its SLA. Hera utilizes the heterogeneous memory requirement of multi-tenant recommendation models to intelligently determine a productive set of co-located models and its resource allocation, providing fast response time while achieving high throughput. We show that Hera achieves an average 37.3% improvement in effective machine utilization, enabling 26% reduction in required servers, significantly improving upon the baseline recommedation inference server.

artificial intelligence, machine learning, social media, (18 more...)

arXiv.org Artificial Intelligence

2302.1175

Genre: Research Report (0.64)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Hardware (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.82)

Add feedback

The Unnerving Sweet Spot for ML-Powered Products

#artificialintelligenceSep-16-2022, 00:20:41 GMT

This article continues a series of articles tackling the most frightening idea in the world of production ML: putting the damn thing in production. In previous stories, we saw two different approaches to designing a Machine Learning (ML) powered application. First, we examined why you'd want to keep your model within your web server and why you should not do it. Keeping your model side-by-side with your core business logic is a must as you experiment with different ideas and want quick feedback from a trusted circle of testers, but it falls short when deploying in production. The first solution we discussed is a very simple technique that permits us to separate the model from our web server.

application, ml-powered product, unnerving sweet spot, (13 more...)

#artificialintelligence

Country: Europe (0.16)

Industry: Government (0.31)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback