Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms

Neural Information Processing Systems

We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression and various implementations of gradient descent.


Debiased Bayesian inference for average treatment effects

Neural Information Processing Systems

Bayesian approaches have become increasingly popular in causal inference problems due to their conceptual simplicity, excellent performance and in-built uncertainty quantification ('posterior credible sets'). We investigate Bayesian inference for average treatment effects from observational data, which is a challenging problem due to the missing counterfactuals and selection bias. Working in the standard potential outcomes framework, we propose a data-driven modification to an arbitrary (nonparametric) prior based on the propensity score that corrects for the first-order posterior bias, thereby improving performance. We illustrate our method for Gaussian process (GP) priors using (semi-)synthetic data. Our experiments demonstrate significant improvement in both estimation accuracy and uncertainty quantification compared to the unmodified GP, rendering our approach highly competitive with the state-of-the-art.



Stand-Alone Self-Attention in Vision Models

Neural Information Processing Systems

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer.


3416a75f4cea9109507cacd8e2f2aefc-AuthorFeedback.pdf

Neural Information Processing Systems

We would like to thank the reviewers for their time and thoughtful comments. In the final version, we will add error bars to capture the variance. We suspect that the effect of changing k is task dependent. Also somehow most of the references were missing in the paper." We leave this to future work.


Error Correction Output Codes for Robust Neural Networks against Weight-errors: A Neural Tangent Kernel Point of View

Neural Information Processing Systems

Error correcting output code (ECOC) is a classic method that encodes binary classifiers to tackle the multi-class classification problem in decision trees and neural networks. Among ECOCs, the one-hot code has become the default choice in modern deep neural networks (DNNs) due to its simplicity in decision making. However, it suffers from a significant limitation in its ability to achieve high robust accuracy, particularly in the presence of weight-errors. While recent studies have experimentally demonstrated that the non-one-hot ECOCs with multi-bits error correction ability, could be a better solution, there is a notable absence of theoretical foundations that can elucidate the relationship between codeword design, weighterror magnitude, and network characteristics, so as to provide robustness guarantees. This work is positioned to bridge this gap through the lens of neural tangent kernel (NTK).


Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement Tao Yang

Neural Information Processing Systems

Disentangled representation learning strives to extract the intrinsic factors within the observed data. Factoring these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention itself can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image into a set of concept tokens and treat them as the condition of the latent diffusion model for image reconstruction, where cross attention over the concept tokens is used to bridge the encoder and the U-Net of the diffusion model. We analyze that the diffusion process inherently possesses the time-varying information bottlenecks.


Optimal Pricing in Repeated Posted-Price Auctions with Different Patience of the Seller and the Buyer

Neural Information Processing Systems

We study revenue optimization pricing algorithms for repeated posted-price auctions where a seller interacts with a single strategic buyer that holds a fixed private valuation. When the participants non-equally discount their cumulative utilities, we show that the optimal constant pricing (which offers the Myerson price) is no longer optimal. In the case of more patient seller, we propose a novel multidimensional optimization functional -- a generalization of the one used to determine Myerson's price. This functional allows to find the optimal algorithm and to boost revenue of the optimal static pricing by an efficient low-dimensional approximation. Numerical experiments are provided to support our results.


Future-proof your career by mastering AI skills for just 20

Popular Science

If you're starting to feel a little behind in your career because you aren't completely proficient with AI, you don't need to worry. Even beginners can quickly master valuable AI skills without any tech background in the ChatGPT & Automation E-Degree program, and it's on sale right now for just 19.97 This program offers 12 captivating modules that allow you to immerse yourself in more than 25 hours of engaging coursework. It will transform your perception of the digital world. You'll master ChatGPT and over 20 AI tools that are indispensable in facing the dynamic challenges in today's coding, business, and marketing industries.


back-propagated output error gradients; (2) A simple training algorithm, sparse in forward and

Neural Information Processing Systems

We thank the reviewers for their feedback. Our paper will be updated to reflect the responses below. E.g., for ResNet18 on ImageNet at 50% sparsity DSG suffers an accuracy loss of 4.6%. Reviewer 2: (1) "Drastic drop due to sparse activations in forward pass": In Figure 1 we isolate the Notably, this means we use the full activation for the backward pass. Thus, STR, CS, GMP only update the active parameters. L1 response of channels is computed.