degeneracy
Transcendental Regularization of Finite Mixtures:Theoretical Guarantees and Practical Limitations
Finite mixture models are widely used for unsupervised learning, but maximum likelihood estimation via EM suffers from degeneracy as components collapse. We introduce transcendental regularization, a penalized likelihood framework with analytic barrier functions that prevent degeneracy while maintaining asymptotic efficiency. The resulting Transcendental Algorithm for Mixtures of Distributions (TAMD) offers strong theoretical guarantees: identifiability, consistency, and robustness. Empirically, TAMD successfully stabilizes estimation and prevents collapse, yet achieves only modest improvements in classification accuracy-highlighting fundamental limits of mixture models for unsupervised learning in high dimensions. Our work provides both a novel theoretical framework and an honest assessment of practical limitations, implemented in an open-source R package.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
Universal Boosting Variational Inference
Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations.
Globally Convergent Policy Search for Output Estimation
We introduce the first direct policy search algorithm which provably converges to the globally optimal dynamic filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain an internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of informativity, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a regularizer which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a "reconditioning step" - converges to the globally optimal cost at a $O(1/T)$ rate.
Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics
Wei, Aaron, Jalali, Milad, Sutherland, Danica J.
Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power. W e address this long-standing limitation by extending the theory of generalized U-statistics and applying it to the usual MMD estimator, resulting in new characterization of the asymptotic distributions of the MMD estimator with unequal sample sizes (particularly outside the proportional regimes required by previous partial results). This generalization also provides a new criterion for optimizing the power of an MMD test with unequal sample sizes. Our approach preserves all available data, enhancing test accuracy and applicability in realistic settings. Along the way, we give much cleaner characterizations of the variance of MMD estimators, revealing something that might be surprising to those in the area: while zero MMD implies a degenerate estimator, it is sometimes possible to have a degenerate estimator with nonzero MMD as well; we give a construction and a proof that it does not happen in common situations.
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- North America > Canada > Quebec (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Parallelizing Tree Search with Twice Sequential Monte Carlo
Oren, Yaniv, de Vries, Joery A., van der Vaart, Pascal R., Spaan, Matthijs T. J., Böhmer, Wendelin
Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.
- Europe > Netherlands > South Holland > Delft (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.89)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.88)
- (2 more...)
Supplement: Scalable and Stable Surrogates for Flexible Classifiers with Fairness Constraints
All relaxations are optimized via our Lagrangian framework. All code was implemented using PyTorch, and optimized using L-BFGS. On the right, the difference framework is used to achieve equality of opportunity on COMP AS. We set the initial learning rate 0.1, which was Here we define equality of opportunity on false negative rates, i.e. predicting that someone Setting s = b, however, causes the linear relaxation to degenerate. For our deep learning experiments, we used the approach of Sec.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Orange County > Irvine (0.14)
key feature of the method is that the measured data in a given direction is not directly used to estimate the denoised
We thank the reviewers for their positive comments on clarity, novelty, and convincing experiments. It was unexpected to us too that such a simple method would work so well. Sparsity in a learned basis is an important approach distinct from our own, and we will mention it. We train one regressor per held-out volume ( 2.2). Pastur (MP) are patch based algorithms, assembling voxels from a patch across volumes into a matrix.
Supplement to: Embedding Principle of Loss Landscape of Deep Neural Networks
However, this transform does not inform about the degeneracy of critical points/manifolds. Clearly, this transform is also a critical transform. For the 1D fitting experiments (Figs. 1, 3(a), 4), we use tanh as the activation function, mean squared We use the full-batch gradient descent with learning rate 0.005 to We use the default Adam optimizer of full batch with learning rate 0.02 to train for We also use the default Adam optimizer of full batch with learning rate 0.00003 Their output functions are shown in the figure. Remark that, although Figs. 1 and 5 are case studies each based on a random trial, similar phenomenon Do the main claims made in the abstract and introduction accurately reflect the paper's Did you state the full set of assumptions of all theoretical results? Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] In the Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
- Asia > China > Shanghai > Shanghai (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > France (0.04)