3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Jung, Minseok, Ricky, Abhas, Chatni, Muhammad Rameez

Nov-18-2025–arXiv.org Artificial Intelligence

AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.

artificial intelligence, inference, optimization problem, (18 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (1.00)

Industry:
- Energy (0.46)

Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found