Reviews: XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems 

Originality: The architecture is novel compare to recent lines of language model work, which all used variation of BERT or GPT (SciBERT, MT-DNN, MASS and etc). The example ("New York is a city" one) makes sense, but considering the permutation is random when computing the objective function, I still couldn't get why it works better than sequential order because human speaks/writes in sequential order. Could you add more intuitions in paper? Or have you tried predicting n-gram, compare to permutation? Quality: Very high considering they did extensive of studies on multiple benchmarks, also the ablation study is nicely done as well.