Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization James Oldfield

Neural Information Processing Systems

The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve finegrained specialization. In this paper, we propose the Multilinear Mixture of Experts (ยตMoE) layer to address this, focusing on vision models.



Guiding a Diffusion Model with a Bad Version of Itself

Neural Information Processing Systems

The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64 64 and 1.25 for 512 512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.





Distributionally Robust Linear Quadratic Control

Neural Information Processing Systems

Linear-Quadratic-Gaussian (LQG) control is a fundamental control paradigm that has been studied and applied in various fields such as engineering, computer science, economics, and neuroscience. It involves controlling a system with linear dynamics and imperfect observations, subject to additive noise, with the goal of minimizing a quadratic cost function depending on the state and control variables. In this work, we consider a generalization of the discrete-time, finite-horizon LQG problem, where the noise distributions are unknown and belong to Wasserstein ambiguity sets centered at nominal (Gaussian) distributions. The objective is to minimize a worst-case cost across all distributions in the ambiguity set, including non-Gaussian distributions. Despite the added complexity, we prove that a control policy that is linear in the observations is optimal, as in the classic LQG problem. We propose a numerical solution method that efficiently characterizes this optimal control policy. Our method uses the Frank-Wolfe algorithm to identify the leastfavorable distributions within the Wasserstein ambiguity sets and computes the controller's optimal policy using Kalman filter estimation under these distributions.




Supplementary Material for " DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks "

Neural Information Processing Systems

A trivial method for satisfying FTU fairness, is to remove the protected attribute from downstream learners. We first provide a motivating example explaining why this is sub-optimal. We then follow this with an experiment on the Adult dataset. A.1 Example Defining fairness is task and data dependent. For example, let us assume two datasets are generated by the graphical models in Figure 1. Data generated by the top graph is considered fair: Education affects past experience (Resume), which together affect future job prospects (Job).