R-Drop, a simple trick to improve DropOut
In each training step, each data sample goes through a model twice. Each pass is processed by a different sub-model which is sampled by dropout. The two outputs distributions P_1(y x) and P_2(y x) are trained to be consistent by minimizing the bidirectional KL divergence between the 2 outputs. The final loss term is as the equation below, which combines the negative log-likelihood loss(cross-entropy) L_NLL and the bidirectional KL divergence L_KL. The KL divergence is measured on both sides: KL(P_1, P_2), KL(P_2, P_1) and the average is calculated.
Oct-27-2021, 22:55:29 GMT
- Technology: