Goto

Collaborating Authors

Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection Bo Han

Neural Information Processing Systems

Out-of-distribution (OOD) detection is crucial for deploying reliable machine learning models in open-world applications. Recent advances in CLIP-based OOD detection have shown promising results via regularizing prompt tuning with OOD features extracted from ID data. However, the irrelevant context mined from ID data can be spurious due to the inaccurate foreground-background decomposition, thus limiting the OOD detection performance. In this work, we propose a novel framework, namely, Self-Calibrated Tuning (SCT), to mitigate this problem for effective OOD detection with only the given few-shot ID data. Specifically, SCT introduces modulating factors respectively on the two components of the original learning objective. It adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization, which is compatible with many prompt tuning based OOD detection methods. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed SCT. The code is publicly available at: https://github.com/tmlr-group/SCT.



BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixtureof Experts Simon Guo 4

Neural Information Processing Systems

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs.



Bias and variance of the Bayesian-mean decoder

Neural Information Processing Systems

Perception, in theoretical neuroscience, has been modeled as the encoding of external stimuli into internal signals, which are then decoded. The Bayesian mean is an important decoder, as it is optimal for purposes of both estimation and discrimination. We present widely-applicable approximations to the bias and to the variance of the Bayesian mean, obtained under the minimal and biologicallyrelevant assumption that the encoding results from a series of independent, though not necessarily identically-distributed, signals. Simulations substantiate the accuracy of our approximations in the small-noise regime. The bias of the Bayesian mean comprises two components: one driven by the prior, and one driven by the precision of the encoding.



Sparse and Continuous Attention Mechanisms André F. T. Martins,, Ant os Treviso Vlad Niculae, Pe

Neural Information Processing Systems

Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g.


f0b76267fbe12b936bd65e203dc675c1-AuthorFeedback.pdf

Neural Information Processing Systems

Note that the VQA results in Table 2 with continuous attention use fewer basis functions than discrete regions. Good idea, we will add this to the camera-ready version. Is this a necessary or a sufficient condition?" Sufficient; we will clarify and follow the suggestions (move the beta-escort definition to the main text and fix typos). We will add a citation. We chose ridge regression as it enables a closed-form solution expressed linearly in terms of the basis functions (Eq. We haven't tried linear interpolation, However, for a high-level vision system, combining our method with BUTD is an interesting idea. Text are naturally discrete tokens."



Supplement: Matrix Completion with Quantified Uncertainty through Low Rank Gaussian Copula

Neural Information Processing Systems

For the first equality, we use Eq. In practice, the result is more useful for small d, such as d = 0. Let us first state a generalization of our Theorem 2. Theorem 4. Suppose x LRGC(W, σ The proof applies to each missing dimension j M. Let us further define s For a detailed treatment of sub-Gaussian random distributions, see [10]. K p for all p 1 with some K > 0. The sub-Gaussian norm of x is defined as ||x|| Our Lemma 2 is Lemma 17 in [1], which is also a simplified version of Theorem 1 in [4]. To compute (2) and (3), we use the law of total expectation similar as in Section 1.1 by first treating z R. The computation for all cases are similar. We take the first case as an example.