bitrate
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Deep Generative Models for Distribution-Preserving Lossy Compression
We propose and study the problem of distribution-preserving lossy compression. Motivated by recent advances in extreme image compression which allow to maintain artifact-free reconstructions even at very low bitrates, we propose to optimize the rate-distortion tradeoff under the constraint that the reconstructed samples follow the distribution of the training data. The resulting compression system recovers both ends of the spectrum: On one hand, at zero bitrate it learns a generative model of the data, and at high enough bitrates it achieves perfect reconstruction.
High-Fidelity Generative Image Compression
We extensively study how to combine Generative Adversarial Networks and learned compression to obtain a state-of-the-art generative lossy compression system. In particular, we investigate normalization layers, generator and discriminator architectures, training strategies, as well as perceptual losses. In contrast to previous work, i) we obtain visually pleasing reconstructions that are perceptually similar to the input, ii) we operate in a broad range of bitrates, and iii) our approach can be applied to high-resolution images.
Vector Quantization using Gaussian Variational Autoencoder
Xu, Tongda, Zheng, Wendi, He, Jiajun, Hernandez-Lobato, Jose Miguel, Wang, Yan, Zhang, Ya-Qin, Tang, Jie
V ector quantized variational autoencoder (VQ-V AE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to dis-cretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian V AE with certain constraint into a VQ-V AE without training. GQ generates random Gaussian noise as a code-book and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian V AE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian V AE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-V AEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian V AE discretization methods, such as TokenBridge. V ector-quantized variational autoencoder (V an Den Oord et al., 2017) is an autoencoder that compresses images into discrete tokens. It is fundamental to autoregressive generative models (Esser et al., 2021; Chang et al., 2022; Y u et al., 2023; Sun et al., 2024b). However, VQ-V AE is difficult to train: the encoding process of VQ-V AE is not differentiable and challenges such as codebook collapse often emerge (Sønderby et al., 2017).
Low-Bitrate Video Compression through Semantic-Conditioned Diffusion
Wang, Lingdong, Su, Guan-Ming, Kothandaraman, Divya, Huang, Tsung-Wei, Hajiesmaili, Mohammad, Sitaraman, Ramesh K.
Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.96)
- Information Technology > Artificial Intelligence > Natural Language (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
RAVQ-HoloNet: Rate-Adaptive Vector-Quantized Hologram Compression
Rafiei, Shima, Babak, Zahra Nabizadeh Shahr, Samavi, Shadrokh, Shirani, Shahram
Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data compression. Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.
- North America > Canada > Ontario > Hamilton (0.14)
- North America > United States > Oklahoma > Beaver County (0.04)
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
Wu, Jiangkai, Ren, Zhiyuan, Liu, Liming, Zhang, Xinggong
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.
- North America > United States > Maryland > Prince George's County > College Park (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (2 more...)
Deep Generative Models for Distribution-Preserving Lossy Compression
We propose and study the problem of distribution-preserving lossy compression. Motivated by recent advances in extreme image compression which allow to maintain artifact-free reconstructions even at very low bitrates, we propose to optimize the rate-distortion tradeoff under the constraint that the reconstructed samples follow the distribution of the training data. The resulting compression system recovers both ends of the spectrum: On one hand, at zero bitrate it learns a generative model of the data, and at high enough bitrates it achieves perfect reconstruction.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)