LLMZip: Lossless Text Compression using Large Language Models
Valmeekam, Chandra Shekhara Kaushik, Narayanan, Krishna, Kalathil, Dileep, Chamberland, Jean-Francois, Shakkottai, Srinivas
–arXiv.org Artificial Intelligence
We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in [1], [2]. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h. There are close connections between learning, prediction, and compression.
arXiv.org Artificial Intelligence
Jun-26-2023
- Country:
- North America > United States
- New York (0.04)
- Texas > Brazos County
- College Station (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Genre:
- Research Report (0.84)
- Technology: