synthetic experiment
Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix
Warrior, Kane, Chakrabarty, Dalia
Gaussian Process (GP) models are widely used as probabilistic models for nonlinear functions because they combine flexible function modelling with uncertainty quantification (Rasmussen and Williams, 2006; Williams, 1998; MacKay, 1992; Neal, 1995). Their predictive performance depends heavily on how kernel hyperparameters are learnt (Sundararajan and Keerthi, 2001). This becomes especially important in higher-dimensional multivariate settings, where many input-specific hyperparameters may be present and where only some inputs may contribute meaningful predictive structure (MacKay, 1992; Neal, 1995; Rasmussen and Williams, 2006; Linkletter et al., 2006; Paananen et al., 2019). In standard Bayesian formulations of GP learning, prior specification is usually imposed directly on kernel hyperparameters such as lengthscales, amplitude parameters, and noise terms (Rasmussen and Williams, 2006; Williams, 1998). This is natural from a modelling point of view, but it does not always give useful control over the covariance structure that those hyperparameters induce over the observed design points (Barnard et al., 2000; Gelman, 2006; Daniels and Kass, 1999; Huang and Wand, 2013). However, it is this induced covariance matrix that directly governs likelihood evaluation, numerical stability, and predictive behaviour (Rasmussen and Williams, 2006; Stein, 1999). 1
Appendix - An Image is Worth More Than a Thousand Words: Towards Disentanglement in The Wild Table of Contents
We use the images at 256 256resolution. We follow [21] and use all the images for training. The images used for the qualitative visualizations contain random images from the web and samples from CelebA-HQ. AFHQ [8] 15,000high quality images categorized into three domains: cat, dog and wildlife. We use the images at 128 128 resolution, holding out 500 images from each domain for testing.
Appendix
In this section we motivate the design choices and inductive biases that we encode into our neural encoder network e, which is the network that is used to model the relative accuracies of the weak supervision sources ฮป. Recall that we model the probability of a particular sample x X having the class label y Y = {1,...,C}as Pฮธ(y|ฮป) = softmax(s)yP(y), (4) s = ฮธ(ฮป,x)Tฮป RC . Connection to prior PGM models We now motivate this choice by deriving a less expressive variant of it from the standard Markov Random Field (MRF) used in the related work. If we view the attention scores ฮธ(ฮป,x) Rm, that assign sample-dependent accuracies to each labeling function, as sample-independent parameters ฮธ1 and, by that, drop the features from the equation - as is done in the related work [30, 32, 19, 11] - we can rewrite Eq. 4 as exp ฮธT1 1 {ฮป = y} P We can recognize Pฮธ as a distribution from the exponential familiy, and more specifically as a pairwise MRF, or factor graph, with canonical parameters ฮธ = (ฮธ1,ฮธ2) and corresponding sufficient statistics, or factors, ฯ(ฮป,y) = (ฯ1(ฮป,y),ฯ2(ฮป)), as well as the log partition function Zฮธ. The accuracy factors and parameters ฯ1,ฮธ1 are the core component of this model and sometimes take the form ฯ1(ฮปy) = ฮปy in binary models as in [30, 19, 11]. The label-independent factors ฯ2(ฮป) have, as can be seen from the derivation above, no direct influence on the latent label posterior, but are often used to model labeling propensities 1 {ฮป 6= 0}and correlation dependencies 1 {ฮปi = ฮปj}, which can be important for PGM parameter learning, but are susceptible to misspecifications [39, 11, 8].
SimultaneousMissingValueImputation andStructureLearningwithGroups
Understanding the structural relationships among different variables provides critical insights in manyreal-worldapplications, suchasmedicine,economics andeducation [42,62]. Thus,learning graphs from observed data, known as structure learning, has recently made remarkable progress [10,61,63,64]. Formanyapplications, variables inthedata can begathered into semantically meaningful groups, where useful insights are at group level. For example, in finance, one may be interested in how a financial situation influences different industries (i.e.
Synthetic experiments (R2, R4)
Teacher learning curve for Frozen lake: the student return induced by the teaching policy at the end of the curriculum improves as CISR trains more students. For CISR, we evaluate a teacher policy trained w/30 students on new test students, while Bandit learns by explore-exploit for each student as [27] can't learn from previous students. Thank you for your helpful comments! Using multiple students enables CISR's key novelty - allowing the teacher to learn This makes CISR applicable,e.g., in a flavor of sim-to-real transfer where a curriculum policy is learned in Thus, we have at least 270 possible curricula. CISR determines a good one after only 10 students attests to its learning ability.