Goto

Collaborating Authors

 cascaded text generation


Cascaded Text Generation with Markov Transformers

Neural Information Processing Systems

The two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.


Review for NeurIPS paper: Cascaded Text Generation with Markov Transformers

Neural Information Processing Systems

Weaknesses: While I am advocating for this paper's acceptance, I'm curious as to whether the authors think this will truly be the dominant approach going forward in this area. I find this approach theoretically more appealing than the Levenshtein transformer, but I think the "global communication" as a negative feature of that model isn't strictly a negative. Sure, the more local nature of this one gives a speedup. But successfully capturing long-range dependencies is one of the things transformer models like GPT-3 seem to be good at. This is a limitation of the paper only evaluating on MT; in MT, the input heavily constrains the shape of the output and long-range output dependencies may not be quite as necessary.


Review for NeurIPS paper: Cascaded Text Generation with Markov Transformers

Neural Information Processing Systems

This paper proposes a semi-autoregressive neural text generation method with cascased transformer where candidate outputs are pruned during decoding through scoring with CRF models of increasing context length. The method supports parallel computing in inference which lead to 7x speed compared to autoregressive method with loss, which is better than most existing non-autoregressive methods. Reviewers all agree that the idea is novel and well develolped. The experiments are extensive and show significant benefit over existing auto-regressive and non-autoregressive methods.


Cascaded Text Generation with Markov Transformers

Neural Information Processing Systems

The two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.