Goto

Collaborating Authors

 Bayesian Inference


WisdomBot: Tuning Large Language Models with Artificial Intelligence Knowledge

arXiv.org Artificial Intelligence

Large language models (LLMs) have emerged as powerful tools in natural language processing (NLP), showing a promising future of artificial generated intelligence (AGI). Despite their notable performance in the general domain, LLMs have remained suboptimal in the field of education, owing to the unique challenges presented by this domain, such as the need for more specialized knowledge, the requirement for personalized learning experiences, and the necessity for concise explanations of complex concepts. To address these issues, this paper presents a novel LLM for education named WisdomBot, which combines the power of LLMs with educational theories, enabling their seamless integration into educational contexts. To be specific, we harness self-instructed knowledge concepts and instructions under the guidance of Bloom's Taxonomy as training data. To further enhance the accuracy and professionalism of model's response on factual questions, we introduce two key enhancements during inference, i.e., local knowledge base retrieval augmentation and search engine retrieval augmentation during inference. We substantiate the effectiveness of our approach by applying it to several Chinese LLMs, thereby showcasing that the fine-tuned models can generate more reliable and professional responses.


A Probabilistic Model for Self-Supervised Learning

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) aims to find meaningful representations from unlabeled data by encoding semantic similarities through data augmentations. Despite its current popularity, theoretical insights about SSL are still scarce. For example, it is not yet known whether commonly used SSL loss functions can be related to a statistical model, much in the same as OLS, generalized linear models or PCA naturally emerge as maximum likelihood estimates of an underlying generative process. In this short paper, we consider a latent variable statistical model for SSL that exhibits an interesting property: Depending on the informativeness of the data augmentations, the MLE of the model either reduces to PCA, or approaches a simple non-contrastive loss. We analyze the model and also empirically illustrate our findings.


Enhancing Robust Fairness via Confusional Spectral Regularization

arXiv.org Artificial Intelligence

Recent research has highlighted a critical issue known as "robust fairness", where robust accuracy varies significantly across different classes, undermining the reliability of deep neural networks (DNNs). A common approach to address this has been to dynamically reweight classes during training, giving more weight to those with lower empirical robust performance. However, we find there is a divergence of class-wise robust performance between training set and testing set, which limits the effectiveness of these explicit reweighting methods, indicating the need for a principled alternative. In this work, we derive a robust generalization bound for the worst-class robust error within the PAC-Bayesian framework, accounting for unknown data distributions. Our analysis shows that the worst-class robust error is influenced by two main factors: the spectral norm of the empirical robust confusion matrix and the information embedded in the model and training set. While the latter has been extensively studied, we propose a novel regularization technique targeting the spectral norm of the robust confusion matrix to improve worst-class robust accuracy and enhance robust fairness. Deep neural networks, spanning a diverse array of domains and applications, have shown impressive abilities to learn from training data and generalize effectively to new, unseen data. However, recent studies have uncovered a notable weakness in these DNNs - their vulnerability to subtle, often undetectable "adversarial attacks" (Biggio et al., 2013; Szegedy et al., 2013). It has been discovered that even slight perturbations to the input, typically imperceptible to humans, can drastically mislead the networks, resulting in significant prediction errors (Goodfellow et al., 2015; Wu et al., 2020a).


Information-theoretic Bayesian Optimization: Survey and Tutorial

arXiv.org Machine Learning

Several scenarios require the optimization of non-convex black-box functions, that are noisy expensive to evaluate functions with unknown analytical expression, whose gradients are hence not accessible. For example, the hyper-parameter tuning problem of machine learning models. Bayesian optimization is a class of methods with state-of-the-art performance delivering a solution to this problem in real scenarios. It uses an iterative process that employs a probabilistic surrogate model, typically a Gaussian process, of the objective function to be optimized computing a posterior predictive distribution of the black-box function. Based on the information given by this posterior predictive distribution, Bayesian optimization includes the computation of an acquisition function that represents, for every input space point, the utility of evaluating that point in the next iteraiton if the objective of the process is to retrieve a global extremum. This paper is a survey of the information theoretical acquisition functions, whose performance typically outperforms the rest of acquisition functions. The main concepts of the field of information theory are also described in detail to make the reader aware of why information theory acquisition functions deliver great results in Bayesian optimization and how can we approximate them when they are intractable. We also cover how information theory acquisition functions can be adapted to complex optimization scenarios such as the multi-objective, constrained, non-myopic, multi-fidelity, parallel and asynchronous settings and provide further lines of research.


Evaluating multiple models using labeled and unlabeled data

arXiv.org Artificial Intelligence

It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1 relative to using labeled data alone and 2.4 relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models. Rigorous evaluation is essential to the safe deployment of machine learning classifiers. The standard approach is to measure classifier performance using a large labeled dataset. In practice, however, labeled data is often scarce (Culotta & McCallum, 2005; Dutta & Das, 2023). Exacerbating the challenge of evaluation, the number of off-the-shelf classifiers has increased dramatically through the widespread usage of model hubs. The modern machine learning practitioner thus has a myriad of trained models, but little labeled data with which to evaluate them. In many domains, unlabeled data is much more abundant than labeled data (Bepler et al., 2019; Sagawa et al., 2021; Movva et al., 2024).


Reviews: Approximate Bayesian Inference for a Mechanistic Model of Vesicle Release at a Ribbon Synapse

Neural Information Processing Systems

The author responses answered my questions as well as points raised by other reviewers, providing additional clarification.] This paper formulates a fully probabilistic model of the vesicle-release dynamics at the sub-cellular biophysical level in the ribbon synapse. The paper then develops a likelihood-free inference method, tests it on a synthetic dataset, and finally infers the parameters of vesicle release in the ribbon synapse from real data. Originality: The paper presents a novel combination of biophysical modeling of ribbon synapse and a likelihood-free inference of the parameters. To my knowledge, the fully stochastic modeling of the vesicle-release dynamics is itself new.


Reviews: Approximate Bayesian Inference for a Mechanistic Model of Vesicle Release at a Ribbon Synapse

Neural Information Processing Systems

This is an interesting paper on a mechanistic model of the ribbon synapse along with an ABC inference approach. Neither component is particularly novel, but the paper is thorough and compelling. The audience will likely be computationally-savvy experimental neuroscientists and those interested in applications of ABC; the former may be harder to find at NeurIPS, though they do exist. I encourage the authors to make the suggested revisions before the camera ready deadline.


Diffusion-aware Censored Gaussian Processes for Demand Modelling

arXiv.org Machine Learning

Inferring the true demand for a product or a service from aggregate data is often challenging due to the limited available supply, thus resulting in observations that are censored and correspond to the realized demand, thereby not accounting for the unsatisfied demand. Censored regression models are able to account for the effect of censoring due to the limited supply, but they don't consider the effect of substitutions, which may cause the demand for similar alternative products or services to increase. This paper proposes Diffusion-aware Censored Demand Models, which combine a Tobit likelihood with a graph diffusion process in order to model the latent process of transfer of unsatisfied demand between similar products or services. We instantiate this new class of models under the framework of GPs and, based on both simulated and real-world data for modeling sales, bike-sharing demand, and EV charging demand, demonstrate its ability to better recover the true demand and produce more accurate out-of-sample predictions.


Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian Perspective

arXiv.org Machine Learning

Model uncertainty quantification involves measuring and evaluating the uncertainty linked to a model's predictions, helping assess their reliability and confidence. Noise injection is a technique used to enhance the robustness of neural networks by introducing randomness. In this paper, we establish a connection between noise injection and uncertainty quantification from a Bayesian standpoint. We theoretically demonstrate that injecting noise into the weights of a neural network is equivalent to Bayesian inference on a deep Gaussian process. Consequently, we introduce a Monte Carlo Noise Injection (MCNI) method, which involves injecting noise into the parameters during training and performing multiple forward propagations during inference to estimate the uncertainty of the prediction. Through simulation and experiments on regression and classification tasks, our method demonstrates superior performance compared to the baseline model.


Reviews: Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much

Neural Information Processing Systems

I think this paper addresses an important issue and makes valuable contributions, and thus should be published. I have a few concerns, hence my lower rating for the last question above (which I think could be addressed relatively easily, however). I think this is fundamentally *OK* and even perhaps a positive thing. However, I think a bit more discussion needs to be given to how the arguments might be made more formal. For example, in Section 2.1, I think the proof is intended to hold only in the limit of M going to infinity. Please give a stament of what should hold in what limit-- this wasn't clear to me.