Appendix to: Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization

Neural Information Processing Systems

With this paper in particular, we improve the performance of Bayesian optimization on problems with mixed types of inputs. Given the ubiquity of such problems in many practical applications, we believe that our method could lead to positive broader impacts by solving these problems better and more efficiently while reducing the costs incurred for solving them. Concrete and high-stake examples where our method could be potentially applied (some of which have been already demonstrated by the benchmark problems considered in the paper) include but are not limited to applications in communications, chemical synthesis, drug discovery, engineering optimization, tuning of recommender systems, and automation of machine learning systems. On the flip side, while the method proposed is ethically neutral, there is potential of misuse given that the exact objective of optimization is ultimately decided by the end users; we believe that practitioners and researchers should be aware of such possibility and aim to mitigate any potential negative impacts to the furthest extent. Let be a compact metric space, and consider the set of functionals ={'s.t.':!P Since Z is finite, each element of ' 2 can be expressed as a mapping from to R Lemma 1. Suppose is continuous in x for every z 2Zand that ': 7! (P R is continuous (using that ' is continuous and is bounded). Since both X and are compact ˆ attains its maximum, i.e., J Corollary 2. Suppose the optimizer of g is unique, i.e., that H Corollary 3. Consider the following mappings: Binary: ':[0, 1]!P These mappings satisfy the conditions for Lemma 1.


Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization

Neural Information Processing Systems

Optimizing expensive-to-evaluate black-box functions of discrete (and potentially continuous) design parameters is a ubiquitous problem in scientific and engineering applications. Bayesian optimization (BO) is a popular, sample-efficient method that leverages a probabilistic surrogate model and an acquisition function (AF) to select promising designs to evaluate. However, maximizing the AF over mixed or high-cardinality discrete search spaces is challenging standard gradient-based methods cannot be used directly or evaluating the AF at every point in the search space would be computationally prohibitive. To address this issue, we propose using probabilistic reparameterization (PR). Instead of directly optimizing the AF over the search space containing discrete parameters, we instead maximize the expectation of the AF over a probability distribution defined by continuous parameters. We prove that under suitable reparameterizations, the BO policy that maximizes the probabilistic objective is the same as that which maximizes the AF, and therefore, PR enjoys the same regret bounds as the original BO policy using the underlying AF.


To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty

Neural Information Processing Systems

We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single-and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.


Supplementary Materials - VIME: Extending the Success of Self-and Semi-supervised Learning to Tabular Domain

Neural Information Processing Systems

Self-supervised learning trains an encoder to extract informative representations on the unlabeled data. Semisupervised learning uses the trained encoder in learning a predictive model on both labeled and unlabeled data. Figure 3: The proposed data corruption procedure. In the experiment section of the main manuscript, we evaluate VIME and its benchmarks on 11 datasets (6 genomics, 2 clinical, and 3 public datasets). Here, we provide the basic data statistics for the 11 used datasets in Table 1.


A Probability Contrastive Learning Framework for 3D Molecular Representation Learning

Neural Information Processing Systems

Contrastive Learning (CL) plays a crucial role in molecular representation learning, enabling unsupervised learning from large scale unlabeled molecule datasets. It has inspired various applications in molecular property prediction and drug design. However, existing molecular representation learning methods often introduce potential false positive and false negative pairs through conventional graph augmentations like node masking and subgraph removal. The issue can lead to suboptimal performance when applying standard contrastive learning techniques to molecular datasets. To address the issue of false positive and negative pairs in molecular representation learning, we propose a novel probability-based contrastive learning (CL) framework. Unlike conventional methods, our approach introduces a learnable weight distribution via Bayesian modeling to automatically identify and mitigate false positive and negative pairs. This method is particularly effective because it dynamically adjusts to the data, improving the accuracy of the learned representations. Our model is learned by a stochastic expectation-maximization process, which optimizes the model by iteratively refining the probability estimates of sample weights and updating the model parameters. Experimental results indicate that our method outperforms existing approaches in 13 out of 15 molecular property prediction benchmarks in MoleculeNet dataset and 8 out of 12 benchmarks in the QM9 benchmark, achieving new state-of-the-art results on average.


The Price of Implicit Bias in Adversarially Robust Generalization

Neural Information Processing Systems

We study the implicit bias of optimization in robust empirical risk minimization (robust ERM) and its connection with robust generalization. In classification settings under adversarial perturbations with linear models, we study what type of regularization should ideally be applied for a given perturbation set to improve (robust) generalization. We then show that the implicit bias of optimization in robust ERM can significantly affect the robustness of the model and identify two ways this can happen; either through the optimization algorithm or the architecture. We verify our predictions in simulations with synthetic data and experimentally study the importance of implicit bias in robust ERM with deep neural networks.



complete

Neural Information Processing Systems

Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels. Moreover, we compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyze the strikingly rich phenomenology that arises. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior and can diverge at the interpolation boundary, even in the absence of label noise. The divergence is caused by the interaction between sampling and initialization and can therefore be eliminated by marginalizing over samples (i.e.


Challenges of Generating Structurally Diverse Graphs

Neural Information Processing Systems

For many graph-related problems, it can be essential to have a set of structurally diverse graphs. For instance, such graphs can be used for testing graph algorithms or their neural approximations. However, to the best of our knowledge, the problem of generating structurally diverse graphs has not been explored in the literature. In this paper, we fill this gap. First, we discuss how to define diversity for a set of graphs, why this task is non-trivial, and how one can choose a proper diversity measure. Then, for a given diversity measure, we propose and compare several algorithms optimizing it: we consider approaches based on standard random graph models, local graph optimization, genetic algorithms, and neural generative models. We show that it is possible to significantly improve diversity over basic random graph generators. Additionally, our analysis of generated graphs allows us to better understand the properties of graph distances: depending on which diversity measure is used for optimization, the obtained graphs may possess very different structural properties which gives a better understanding of the graph distance underlying the diversity measure.


Supplemental Materials: Data Augmentation MCMC for Bayesian Inference from Privatized Data S-1 Statement on Societal Impacts

Neural Information Processing Systems

We do not foresee direct negative societal impact from the current work. Admittedly, our method is based on imputing the confidential database which privacy mechanisms seek to protect. We can assure the reader that such imputations are based on formally differentially private data products and hence do not violate differential privacy. Also, one may argue that our work is catalytic to enhancing the'disclosure risk' of individuals, i.e. an adversary might be able to make accurate posterior inference about an individual if the adversary has highly informative and correct prior and modeling information to begin with. Granted, no existing privacy frameworks can guard against this.