C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning
–Neural Information Processing Systems
Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost.
Neural Information Processing Systems
Jun-12-2026, 22:00:27 GMT