Review for NeurIPS paper: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Neural Information Processing Systems 

Weaknesses: I was a little confused about how the grouped 1x1 convolutions interact with the coupling layers. If the standard (half-and-half) partitioning is used for the coupling layers and the grouped 1x1 convolutions never mix channels outside of their group of 4, then half of the channels will never be transformed by any coupling layer. I'm assuming the authors deal with this issue somehow (since the results are good), but I only briefly scanned the code and didn't want to work through all of the index gymnastics. I could see readers being confused by these missing details. Update: In their response, the authors said they will explain more of the details of the grouped 1x1 convolutions in their revised version.