A Appendix

Neural Information Processing Systems 

M) null /τ null, (10) which is derived by considering Equation 4. To simplify the equation, we hold L M is the dimensional mask. Note that the scalar derivation, e.g., the MetaMask's training paradigm as follows. In order to prove Theorem 5.2 and the conclusion that the bounds of supervised cross-entropy loss A.2.1 Proof for the Equality Part To prove Φ null g To prove Equation 20, we demonstrate an evidence example in Figure 5. The reason behind such a phenomenon is that, following Theorem 5.1, the self-paced dimensional mask jointly enhances the gradient Being aware of proofs in Section A.2.1 and Section A.2.2, we confirm the validation of Theorem Then, we bring Theorem 5.2 into Theorem 5.1 to derive the comparison of the lower bounds of Therefore, the lower bound obtained by the masked representation, i.e., MetaMask, is larger than the Concretely, we conclude that our approach can better bound the downstream classification risk, i.e., However, our dimensional confounder is defined as a negative factor that may lead to model degradation, which is proposed from the dimensional perspective. MetaMask using a trick of fixed learning rate instead of the cosine annealing strategy.