Goto

Collaborating Authors

 taylor expansion


Supplementary to " Approximation with CNNs in Sobolev Space: with Applications to Classification "

Neural Information Processing Systems

In the Supplementary materials, we include detailed descriptions on convex surrogate losses,convolutional neural networks, non-asymptotic error bounds for commonly used loss functions, and prove Theorems 2.1,2.2, A toy example on the numerical performance of CNN approximation is presented in Appendix D. We next give a brief review of the convex surrogate loss functions and discuss in details on the connection between the excess risk with respect to the ฯ•-loss and that of 0-1 loss [28, 4]. Let ฯ•be a given convex univariate function ฯ•: R [0,). Instead of minimizing the excess risk R over H, we consider minimizing the risk with respect to the loss ฯ•(ฯ•-risk) R(f):= E{ฯ•(Yf(X))} over a certain class of functions F, where ฯ•: R [0,) is some generic loss function. For the special case when H = {h: h(x) = sign(f(x)),f F} and ฯ•() is a step function, i.e., ฯ•(x) = 1 Guohao Shen and Yuling Jiao contributed equally to this work Corresponding authors 36th Conference on Neural Information Processing Systems (NeurIPS 2022). As shown in [28] and [4], for a properly chosen ฯ•, ห†fn can indeed help reduce the 0-1 excess risk R (ห†hn) R (h0). More precisely, let R0:= inff measurable R(f), then for a proper ฯ•, we have ฯˆ(R (ห†hn) R (h0)) R(ห†fn) R(f0), where ฯˆ: [ 1,1] [0,)is a nonnegative continuous function, invertible on [0,1], and achieves its minimum at 0 with ฯˆ(0) = 0. A wide variety of popular classification methods are based on this tactic.



Energy Score-Guided Neural Gaussian Mixture Model for Predictive Uncertainty Quantification

arXiv.org Machine Learning

Quantifying predictive uncertainty is essential for real world machine learning applications, especially in scenarios requiring reliable and interpretable predictions. Many common parametric approaches rely on neural networks to estimate distribution parameters by optimizing the negative log likelihood. However, these methods often encounter challenges like training instability and mode collapse, leading to poor estimates of the mean and variance of the target output distribution. In this work, we propose the Neural Energy Gaussian Mixture Model (NE-GMM), a novel framework that integrates Gaussian Mixture Model (GMM) with Energy Score (ES) to enhance predictive uncertainty quantification. NE-GMM leverages the flexibility of GMM to capture complex multimodal distributions and leverages the robustness of ES to ensure well calibrated predictions in diverse scenarios. We theoretically prove that the hybrid loss function satisfies the properties of a strictly proper scoring rule, ensuring alignment with the true data distribution, and establish generalization error bounds, demonstrating that the model's empirical performance closely aligns with its expected performance on unseen data. Extensive experiments on both synthetic and real world datasets demonstrate the superiority of NE-GMM in terms of both predictive accuracy and uncertainty quantification.


QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion

Neural Information Processing Systems

Vision transformer model (ViT) is widely used and performs well in vision tasks due to its ability to capture long-range dependencies. However, the time complexity and memory consumption increase quadratically with the number of input patches which limits the usage of ViT in real-world applications. Previous methods have employed linear attention to mitigate the complexity of the original self-attention mechanism at the expense of effectiveness. In this paper, we propose QT-ViT models that improve the previous linear self-attention using quadratic Taylor expansion. Specifically, we substitute the softmax-based attention with second-order Taylor expansion, and then accelerate the quadratic expansion by reducing the time complexity with a fast approximation algorithm. The proposed method capitalizes on the property of quadratic expansion to achieve superior performance while employing linear approximation for fast inference. Compared to previous studies of linear attention, our approach does not necessitate knowledge distillation or high-order attention residuals to facilitate the training process. Extensive experiments demonstrate the efficiency and effectiveness of the proposed QT-ViTs, showcasing the state-of-the-art results. Particularly, the proposed QT-ViTs consistently surpass the previous SOTA EfficientViTs under different model sizes, and achieve a new Pareto-front in terms of accuracy and speed.