Collaborating Authors

An Introduction to Working with BERT in Practice


Luckily, smaller pretrained BERT or XLNET models are becoming increasingly available for free, and they may well serve as stepping stones for fine-tuning. This means that, in practice, you start from downloading a pre-trained BERT or XLNET model, incorporate it into your network, and fine-tune it with much more manageable, smaller datasets. In this article, we'll see how that works. First, let's start with incorporating existing BERT models in our models. For this to work, we need a dedicated BERT layer: a landing hub for BERT models.

Constructing Gaussian Processes for Probabilistic Graphical Models

AAAI Conferences

Probabilistic graphical models have been successfully applied in a lot of different fields, e.g., medical diagnosis and bio-statistics. Multiple specific extensions have been developed to handle, e.g., time-series data or Gaussian distributed random variables. In the case that handles both Gaussian variables and time-series data, downsides are that the models still have a discrete time-scale, evidence needs to be propagated through the graph and the conditional relationships between the variables are bound to be linear. This paper converts two probabilistic graphical models (the Markov chain and the hidden Markov model) into Gaussian processes by constructing covariance and mean functions, that encode the characteristics of the probabilistic graphical models. Our developed Gaussian process based formalism has the advantage of supporting a continuous time scale, direct inference from any time point to the other without propagation of evidence and flexibility to modify the covariance function if needed.

Cooperative Graphical Models

Neural Information Processing Systems

We study a rich family of distributions that capture variable interactions significantly more expressive than those representable with low-treewidth or pairwise graphical models, or log-supermodular models. We call these cooperative graphical models. Yet, this family retains structure, which we carefully exploit for efficient inference techniques. Our algorithms combine the polyhedral structure of submodular functions in new ways with variational inference methods to obtain both lower and upper bounds on the partition function. While our fully convex upper bound is minimized as an SDP or via tree-reweighted belief propagation, our lower bound is tightened via belief propagation or mean-field algorithms.

Uprooting and Rerooting Higher-Order Graphical Models

Neural Information Processing Systems

The idea of uprooting and rerooting graphical models was introduced specifically for binary pairwise models by Weller (2016) as a way to transform a model to any of a whole equivalence class of related models, such that inference on any one model yields inference results for all others. This is very helpful since inference, or relevant bounds, may be much easier to obtain or more accurate for some model in the class. Here we introduce methods to extend the approach to models with higher-order potentials and develop theoretical insights. In particular, we show that the triplet-consistent polytope TRI is unique in being universally rooted'. We demonstrate empirically that rerooting can significantly improve accuracy of methods of inference for higher-order models at negligible computational cost.

Bayesian Joint Estimation of Multiple Graphical Models

Neural Information Processing Systems

In this paper, we propose a novel Bayesian group regularization method based on the spike and slab Lasso priors for jointly estimating multiple graphical models. The proposed method can be used to estimate the common sparsity structure underlying the graphical models while capturing potential heterogeneity of the precision matrices corresponding to those models. Our theoretical results show that the proposed method enjoys the optimal rate of convergence in $\ell_\infty$ norm for estimation consistency and has a strong structure recovery guarantee even when the signal strengths over different graphs are heterogeneous. Through simulation studies and an application to the capital bike-sharing network data, we demonstrate the competitive performance of our method compared to existing alternatives. Papers published at the Neural Information Processing Systems Conference.