Reviews: Bayesian Compression for Deep Learning

Neural Information Processing Systems 

This paper approaches model compression using a group sparsity prior, to allow entire columns rather than just individual weights to be dropped out. They also use the variance of the posterior distribution over weights to automatically set the precision for fixed point weight quantization. The underlying ideas seem good, and the experimental results seem promising. However, the paper supports the core idea with a great deal of mathematical complexity. The math was presented in a way that I often found confusing, and in several places seems either wrong or poorly motivated (e.g., KL divergences are negative, right and left side of equations are not equal, primary motivation for model compression given in terms of minimum description length).