Goto

Collaborating Authors

 decompressor


FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Della Libera, Luca, Subakan, Cem, Ravanelli, Mirco

arXiv.org Artificial Intelligence

ABSTRACT Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency.


LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Xie, Xingyu, Lin, Zhijie, Toh, Kim-Chuan, Zhou, Pan

arXiv.org Artificial Intelligence

Abstract--To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE. This progress is largely attributed to the advent of largescale To address the challenge of communication efficiency in models, like the GPT and LLAMA series [1], [5]-[7], large-scale model training, error-feedback compression [17], characterized by their billions of parameters and trillions [18] (EFC) has been developed to compensate for communication of training tokens. This trend of large-scale models has variables before compression, ensuring small compression expanded into various other fields, including finance [8], errors. This technique has been utilized in gradient compression law [9], and medicine [10]. Despite their successes, these to create communication-efficient low-bit optimizers, large-scale models necessitate extensive GPUs for parallel such as 1-bit Adam [14] and 1-bit LAMB [19]. However, these training, employing strategies like data parallelism [11], low-bit optimizers face several key challenges.


From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs

Liguori, Vincenzo

arXiv.org Artificial Intelligence

This paper attempts to address and reconcile two different issues: the existence of multiple numerical data formats (such as int8, bfloat16, fp8, etc., often non optimal for the application and not directly compatible with one another) and the necessity to reduce their bandwidth requirements, especially in the case of power hungry and slow DRAM. In other words, we would like to be able to support multiple numerical data formats and use a minimal number of bits to represent them while, at the same, not being penalised by the outliers and forced to use a worst-case number of bits to represent them all. This is particularly important for LLMs that have a huge number of weights that can come in a variety of formats. This is also true, to a lesser extent, for CNNs. Activations are also likely to benefit from such approach.