Neural Information Processing Systems
PointAD: Comprehending 3D Anomalies from Points and Pixels for Zero-shot 3D Anomaly Detection
Zero-shot (ZS) 3D anomaly detection is a crucial yet unexplored field that addresses scenarios where target 3D training samples are unavailable due to practical concerns like privacy protection. This paper introduces PointAD, a novel approach that transfers the strong generalization capabilities of CLIP for recognizing 3D anomalies on unseen objects. PointAD provides a unified framework to comprehend 3D anomalies from both points and pixels.
3ab6be46e1d6b21d59a3c3a0b9d0f6ef-AuthorFeedback.pdf
R1: "I am not entirely convinced that an amortized explanation model is a reasonable thing. R2: "I would appreciate some clarification about what is gained by learning ร and not just reporting ฮฉ directly." We thank R1, R2 and R3 for their insightful feedback. However, ฮฉ can only be computed given ground truth labels. R1: "(The objective) does not attribute importance to features that change how the model goes wrong (...)" R1: "Why is the rank-based method necessary?" R2: "Additionally, can the authors clarify what is being averaged in the definition of the causal objective?" The causal objective is averaged over all N samples in the dataset. Every data point has an ฮฉ. R2: "If the goal is to determine what might happen to our predictions if we change a particular feature slightly Our goal is not to estimate what would happen if a particular feature's value changed, but to provide a causal explanation R2: "Some additional clarity on why the authors are using a KL discrepancy is merited. R3: "Masking one by one; this is essentially equivalent to assuming that feature contributions are additive." We do not define a feature's importance as its additive contribution to the model output, but as it's marginal reduction This subtle change in definition allows us to efficiently compute feature importance one by one. R3: "Replacing a masked value by a point-wise estimation can be very bad, especially when the classifiers output Why would the average value (or, even worse, zero) be meaningful?" We will clarify this point in the next revision. R3: "It would also be interesting to compare the proposed method with causal inference technique for SEMs." Recent work [29] has explored the use of SEMs for model attribution in deep learning. CXPlain can explain any machine-learning model, and (ii) attribution time was considerably slower than CXPlain. R3: "It seems to me that the chosen performance measure may correlate much more with the Granger-causal loss
Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory
Counterfactual explanations provide ways of achieving a favorable model outcome with minimum input perturbation. However, counterfactual explanations can also be leveraged to reconstruct the model by strategically training a surrogate model to give similar predictions as the original (target) model. In this work, we analyze how model reconstruction using counterfactuals can be improved by further leveraging the fact that the counterfactuals also lie quite close to the decision boundary. Our main contribution is to derive novel theoretical relationships between the error in model reconstruction and the number of counterfactual queries required using polytope theory. Our theoretical analysis leads us to propose a strategy for model reconstruction that we call Counterfactual Clamping Attack (CCA) which trains a surrogate model using a unique loss function that treats counterfactuals differently than ordinary instances. Our approach also alleviates the related problem of decision boundary shift that arises in existing model reconstruction approaches when counterfactuals are treated as ordinary instances. Experimental results demonstrate that our strategy improves fidelity between the target and surrogate model predictions on several datasets.
Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers
Liwei Wu, Shuqing Li, Cho-Jui Hsieh, James L. Sharpnack
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastic shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information. We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed SSE can further reduce overfitting, which often leads to more favorable generalization results.
One Epoch
Figure 1: Projecting 50-dimensional embeddings obtained by training a simple neural network without SSE (Left), and with SSE-Graph (Center), SSE-SE (Right) into 3D space using PCA. Table 1: The experimental results of BERT-NER with SSE-SE and without SSE-SE when we only allow fine-tuning BERT for one epoch and two epochs. The same set of hyper-parameters are used. We thank the reviewers for their insightful feedback. In the following, we address their concerns and questions.
A Flexible Framework for Designing Trainable Priors with Adaptive Smoothing and Game Encoding Inria
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems, and whose architectures are derived from an optimization algorithm. We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end. The priors used in this presentation include variants of total variation, Laplacian regularization, bilateral filtering, sparse coding on learned dictionaries, and non-local self similarities. Our models are fully interpretable as well as parameter and data efficient. Our experiments demonstrate their effectiveness on a large diversity of tasks ranging from image denoising and compressed sensing for fMRI to dense stereo matching.
Sample Complexity of Posted Pricing for a Single Item Billy Jin Thomas Kesselheim Will Ma
Selling a single item to n self-interested buyers is a fundamental problem in economics, where the two objectives typically considered are welfare maximization and revenue maximization. Since the optimal mechanisms are often impractical and do not work for sequential buyers, posted pricing mechanisms, where fixed prices are set for the item for different buyers, have emerged as a practical and effective alternative. This paper investigates how many samples are needed from buyers' value distributions to find near-optimal posted prices, considering both independent and correlated buyer distributions, and welfare versus revenue maximization. We obtain matching upper and lower bounds (up to logarithmic factors) on the sample complexity for all these settings.
FedLPA: One-shot Federated Learning with Layer-Wise Posterior Aggregation
Efficiently aggregating trained neural networks from local clients into a global model on a server is a widely researched topic in federated learning. Recently, motivated by diminishing privacy concerns, mitigating potential attacks, and reducing communication overhead, one-shot federated learning (i.e., limiting client-server communication into a single round) has gained popularity among researchers. However, the one-shot aggregation performances are sensitively affected by the non-identical training data distribution, which exhibits high statistical heterogeneity in some real-world scenarios. To address this issue, we propose a novel one-shot aggregation method with layer-wise posterior aggregation, named FedLPA. FedLPA aggregates local models to obtain a more accurate global model without requiring extra auxiliary datasets or exposing any private label information, e.g., label distributions. To effectively capture the statistics maintained in the biased local datasets in the practical non-IID scenario, we efficiently infer the posteriors of each layer in each local model using layer-wise Laplace approximation and aggregate them to train the global parameters. Extensive experimental results demonstrate that FedLPA significantly improves learning performance over state-of-the-art methods across several metrics.
2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf
Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom.