An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models

Neural Information Processing Systems 

Deep neural networks have long been criticized for being black-box.