Compressing Multisets with Large Alphabets using Bits-Back Coding

Severo, Daniel, Townsend, James, Khisti, Ashish, Makhzani, Alireza, Ullrich, Karen

Feb-27-2023–arXiv.org Artificial Intelligence

Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how to convert a compression algorithm for sequences into one for multisets, in exchange for an additional complexity term that is quasi-linear in sequence length. This allows us to compress multisets of exchangeable symbols at an optimal rate, with computational complexity decoupled from the alphabet size. The key insight is to avoid encoding the multiset directly, and instead compress a proxy sequence, using a technique called'bits-back coding'. We demonstrate the method experimentally on tasks which are intractable with previous optimal-rate methods: compression of multisets of images and JavaScript Object Notation (JSON) files. Lossless compression algorithms typically preserve the ordering of compressed symbols in the input sequence. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Formally, these may be expressed as a mathematical object known as a multiset: a generalization of a set that allows for repetition of elements. Compressing a multiset with an arithmetic coder is possible if we somehow order its elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary.

artificial intelligence, machine learning, multiset, (16 more...)

arXiv.org Artificial Intelligence

Feb-27-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found