Goto

Collaborating Authors

 Bayesian Inference


Reviews: Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables

Neural Information Processing Systems

The proposed method is very similar to previous work by Nie et al. -- both use k-trees to search for low-treewidth Bayesian networks, both start with a randomly chosen initial clique, and both propose using an A* method for finding the best tree. The differences are that Nie et al. score k-trees using a mutual information score and use BDeu for choosing the final consistent Bayesian network, while this paper proposes using BIC and incrementally building the Bayesian network along with the k-tree, using the BN to score the k-tree. This paper also includes the additional restriction that the complete variable (partial) order is chosen randomly, while in Nie et al. The main justification for these differences is the ability to scale to large treewidths. However, in the experiments, the previous S2 algorithm also can scale to large treewidths.


Reviews: Kernel Bayesian Inference with Posterior Regularization

Neural Information Processing Systems

This paper provides an interesting connection between kernel Bayesian inference and vector valued regression. Based on this, a new regularization method is provided to compute an approximation of the kernel embedding of the posterior distribution. Simulation results look promising, suggesting that the new method gains improvement over many existing methods. However, as a non expert, from reading the current introduction, I'm still confused about the motivation of using kernel Bayesian inference---in order to approximate the kernel embedding of the posterior, a sample of iid draws (x_i, y_i) from the joint distribution of the parameter/hidden variable (X in the paper) and data (Y in the paper) are assumed to be available. First, it is a highly non-trivial problem of obtaining samples (x_i)'s from the posterior.


Reviews: Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Neural Information Processing Systems

Overall, I found the paper interesting; the paper offers new theory as well as numerical results comparable to the state of the art on decently difficult datasets. Perhaps due to space constraints, an important part of the paper (section 3.2) - the inference algorithm - is poorly explained. In particular, I initially thought that the use of particles meant that the approximating distribution was a sum of Dirac delta functions - but that cannot be the case since, even with many particles, the'posterior' would degenerate into the MAP (note that in similar work, authors either use particles when p(x) involves discrete x variables, as in Kulkarni et al, or'smooth' the particles to approximate a continuous distribution, as in Gershman et al). Instead, it looks like the algorithm works directly on samples of the distribution q0, q1.. (hence the vague'for whatever distribution q that {xi}ni 1 currently represents'). It is tempting to consider q_i to be a kernel density estimate (mixture of normals with fixed width), and see if we can approximate equation 9 for that representation to be stable.


Reviews: A Bayesian method for reducing bias in neural representational similarity analysis

Neural Information Processing Systems

The paper explains well how computing RSA using estimates of regression weights can result in a biased similarity matrix. However, in many cases in neuroscience, the RSA is computed directly on the patterns of activity, and not the estimates of regression weights beta. This diminishes the relevance of this paper to the neuroscience field. The authors very briefly address this alternate way of computing RSA in lines 123-128. It is unclear how this alternative RSA computation is biased if it does not depend on a proxy for beta estimates, and needs to be addressed further.


Reviews: Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making

Neural Information Processing Systems

The authors motivate the proposed model with the setting in which items have "true" but unobserved labels/ratings and the observed labels/ratings given by evaluators are potentially incorrect. This differs from the very common problem in recommendation systems or collaborative filtering where evaluators provide their subjective ratings but there is not assumed to be any "true" rating (e.g., users of Netflix giving 1-5 star ratings to movies). This seems like a common but underexplored setting that is worthy of further study within machine learning. The authors are also right to highlight interpretability as a desired aspect of any machine learning solution that may yield post-hoc insights into common human biases and thus suggest corrective measures. This paper does a good job of motivating the proposed model and situating it within the crowdsourcing and human annotation literature.


Reviews: Near-Optimal Smoothing of Structured Conditional Probability Matrices

Neural Information Processing Systems

If my understanding is correct, Theorem 1 of the authors does not quite apply to their algorithm ADD-1/2-Smoothed Low-Rank. Instead, it applies to the non-computable algorithm where they assume that they have a minimizer of the objective function in Theorem 3. It is not clear if the alternating optimization algorithm proposed in the paper is guaranteed to converge to a minimizer of the objective in Theorem 3. If this is true, the authors should mention this before stating Theorem 1 to avoid misleading the reader. The "discounting" seems important from the Experiments section but this is not described in the main paper. If this is so important, the authors should make room for this in the main paper. The main results (Theorem 1 and 2) are not so surprising given that this is almost a parametric estimation problem with mk parameters (so the rates should be km/n).


Reviews: PAC-Bayesian Theory Meets Bayesian Inference

Neural Information Processing Systems

The paper is well written and theoretically strong. It's been conjectured in the past that there should be links between PAC-Bayes theory and Bayesian inference, but to my knowledge this is the first theoretically complete demonstration of such links. Some comments: - In eq(8) (and above) the notion of a prior with bounded likelihood is introduced. Am I right in thinking that this is a data-dependent prior, since it can only be known if the likelihood will be bounded for a given prior after observing the data? If this is not the case can you explain how such a prior is possible?


Reviews: Rényi Divergence Variational Inference

Neural Information Processing Systems

This is a very good and technically sound paper, containing a significant amount of material. The theoretical investigation of the properties of alpha-divergence minimization is thorough, clear and detailed. The paper provides significant theoretical insight and understanding into alpha-divergence minimization and optimization-based approximate inference in general. My biggest concern about the alpha-divergence framework is whether its theoretical richness and elegance actually translates to practical methods. In other words, I'm not sure that the practical aspects of it are appealing enough to convince practitioners of variational inference to switch to alpha-divergence minimization instead.


Reviews: Reward Augmented Maximum Likelihood for Neural Structured Prediction

Neural Information Processing Systems

The paper is a superbly written account of a simple idea that appears to work very well. The approach can straightforwardly be applied to existing max-likelihood (ML) trained models in order to in principle take into account the task reward during training and is computationally much more efficient than alternative non ML based approaches. This work risks being underappreciated as proposing but a simple addition of artificial structured-label noise, but I think the specific link with structured output task reward is sufficiently original, and the paper also uncovers important theoretical insight by revealing the formal relationship between the proposed reward augmented ML and RL-based regularized expected reward objectives. So while it works surprisingly well, you haven't yet clearly demonstrated empirically that using a truly *task-reward derived* payoff distribution is beneficial. One way to convincingly demonstrate that would be if you did your envisioned BLEU importance reweighted sampling, and were able to show that it improves the BLEU test score over your current simpler edit-distance based label noise.


Reviews: Learning under uncertainty: a comparison between R-W and Bayesian approach

Neural Information Processing Systems

This is an interesting modeling and model comparison paper, providing insights into the processing of uncertainty during learning and decision making. The paper combines advances that could be interesting to both experimental and modeling audiences. However, its clarity should be improved and parameter estimation details explained much better for the paper to be acceptable to NIPS. More specifically: - Why should highly volatile environments have high learning rates (line 2 of page 2)? Couldn't it plausibly lead to excessive weight instability?