Differentiable Entropy Regularization: A Complexity-Aware Approach for Neural Optimization

Shihab, Ibne Farabi, Akter, Sanjeda, Sharma, Anuj

arXiv.org Artificial Intelligence 

We introduce the first differentiable approximation of range-partition entropy, a complexity measure from computational geometry that directly bounds algorithmic runtime. Unlike architectural modifications, our method is a complementary regularizer that provides orthogonal efficiency gains when combined with existing optimizations. We establish theoretical guarantees in computational geometry, achieving 4-5 provable speedups on convex hull and triangulation with <0.2% error. On ImageNet-1K with ViT -Base, entropy regularization achieves 80.1% top-1 accuracy at 80% sparsity (1.60 standalone speedup), and when combined with FlashAttention yields 2.07 speedup versus 1.63 for FlashAttention alone. On large language models (LLaMA-2 7B, Mistral-7B, Phi-2), we achieve 1.48-1.60 Unlike prior regularization methods that target output distributions, we directly minimize representation complexity, yielding both efficiency gains and improved robustness through semantically structured sparsity patterns (IoU 0.73 vs 0.41 for magnitude pruning, CIFAR-100-C mCE 48.7 vs 55.4). Benefits are strongest for geometry and vision transformers, with more modest but measurable gains on LLMs, demonstrating that complexity regularization offers a principled pathway to joint efficiency-robustness optimization. Modern deep networks achieve impressive performance but face two critical challenges. They are fragile under distribution shift (Hendrycks & Dietterich, 2019) and require prohibitive computational costs (Strubell et al., 2019). The standard approach treats these problems independently, addressing robustness through data augmentation and efficiency through architectural changes. We ask: can a single principle address both? Our strongest results are in computational geometry and vision transformers; for large language models the overhead-benefit tradeoff is more nuanced, with improvements primarily in high-throughput deployment settings rather than research-scale fine-tuning. The key insight comes from computational geometry. Algorithmically simple representations, those with low complexity under geometric partitions, both enable faster algorithms via instance-optimal procedures (Chan, 1996) and generalize better by avoiding spurious features (Geirhos et al., 2020). However, no existing method provides a differentiable measure of algorithmic complexity that can be optimized end-to-end during neural network training. Robustness Label smoothing Y es No No No Adversarial training Y es No No No Info-theoretic Information bottleneck Y es No No No Fixed sparse Longformer, BigBird No No No No Kernel opt. Our goal is to connect representation learning to algorithmic complexity.