Goto

Collaborating Authors

 xij


Neural Generalized Mixed-Effects Models

Slavutsky, Yuli, Salazar, Sebastian, Blei, David M.

arXiv.org Machine Learning

Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear function of observed covariates and a latent group-specific random effect. Since exact marginalization over the random effects is typically intractable, model parameters are estimated by maximizing an approximate marginal likelihood. In this paper, we replace the linear function with neural networks. The result is a more flexible model, the neural generalized mixed-effects model (NGMM), which captures complex relationships between covariates and responses. To fit NGMM to data, we introduce an efficient optimization procedure that maximizes the approximate marginal likelihood and is differentiable with respect to network parameters. We show that the approximation error of our objective decays at a Gaussian-tail rate in a user-chosen parameter. On synthetic data, NGMM improves over GLMMs when covariate-response relationships are nonlinear, and on real-world datasets it outperforms prior methods. Finally, we analyze a large dataset of student proficiency to demonstrate how NGMM can be extended to more complex latent-variable models.





4fc81f4cd2715d995018e0799262176b-Supplemental-Conference.pdf

Neural Information Processing Systems

Two other important techniques are mixed precision training [36] and in-place activated BatchNorm [53]. Mixed precision training involves training using both 32-bit and 16-bit IEEE floating point numbers depending onthenumerical sensitivityofdifferent layers [36].


FactorGraphNeuralNet--SupplementaryFile AProof of propositions

Neural Information Processing Systems

First we provide Lemma 8, which will be used in the proof of Proposition 2 and 4. Lemma 8. Given n non-negative feature vectors fi =[fi0,fi1,...,fim], where i=1,...,n, there exists n matrices Qi with shape nm m and n vector ˆfi =QifTi, s.t. Proposition 2. A factor graph G =(V,C,E) with variable log potentialsθi(xi) and factor log potentials ϕc(xc) can be converted to a factor graph G0 with the same variable potentials and the decomposed log-potentials ϕic(xi,zc) using a one-layer FGNN. Without loss of generality, we assume that logφc(xc)>1. Then for each i the item θic(xi,zc) in (9) have kn+1 entries, and each entry is either a scaled entry of the vectorgc or arbitrary negative number less than maxxcθc(xc). Thusifweorganize θic(xi,zc) asalength-kn+1 vector fic, thenwedefinea kn+1 kn matrix Qci, where if and only if thelth entry of fic is set to the mth entry of gc multiplied by 12 1/|s(c)|, the entry of Qci in lth row, mth column will be set to 1/|s(c)|; all the other entries of Qci is set to some negative number smaller than maxxcθc(xc).


C qNEHVIunderDifferentComputationalApproaches C.1 DerivationofIEPformulationofqNEHVI From(4),theexpectednoisyjointhypervolumeimprovementisgivenby

Neural Information Processing Systems

Bayesian Optimization specifically aims toincrease sample efficiencyfor hard optimization algorithms, and consequently can help achieve better solutions without incurring large societal costs. In the 2-objective case, instead of padding the box decomposition, the Pareto frontier under each posterior sample can be padded instead by repeating a point on the Pareto Frontier such that the padded Pareto frontier under every posterior sample has exactlymaxt|Pt| points. Since the sequentialNEHVIis equivalent to theqNEHVIwith q = 1, we prove Theorem 1 for the generalq > 1 case. Recall from Section C.2, that using the method of common random numbers tofixthebasesamples, theIEPandCBDformulations areequivalent. Note that the box decomposition of the non-dominated space{S1,...,SKt} and the number of rectangles in the box decomposition depend onζt.


07bc722f08f096e6ea7ee99349ff0a86-Paper-Conference.pdf

Neural Information Processing Systems

In this paper, we study dataset distillation (DD), from a novel perspective and introduce adataset factorizationapproach, termedHaBa, which is a plug-andplay strategy portable to any existing DD baseline.