Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Nov-18-2025–arXiv.org Artificial Intelligence

Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.48)

Industry:
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning (1.00)
  - Natural Language
    - Large Language Model (0.70)
    - Chatbot (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found