Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Oct-23-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Oct-23-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East > Iraq
    - Babil Governorate (0.04)
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
- North America > United States
  - Florida > Miami-Dade County
    - Miami (0.04)
  - Illinois > Cook County
    - Chicago (0.04)
  - New Mexico > Bernalillo County
    - Albuquerque (0.04)
  - New York > Suffolk County
    - Stony Brook (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)
  - Natural Language > Large Language Model (1.00)