United States
SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement
Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. Existing works explore adaptive batching or hand-tuned static large batching, in order to strike a balance between the computational efficiency and the performance. However, these methods can provide only coarse-grained adaption (e.g., at a epoch level) due to the intrinsic expensive calculation or hand tuning requirements. In this paper, we propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e.g., at a mini-batch level) that can achieve stateof-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology by providing a plugin tool that supports mainstream machine learning frameworks. Extensive evaluations on popular benchmarks (e.g., CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78k in BERT-Large pretraining with SQuAD score 90.69 compared to 90.58 reported in previous state-of-the-art with 59k batch size.
Nested Variational Inference Hao Wu Jan-Willem van de Meent
We develop nested variational inference (NVI), a family of methods that learn proposals for nested importance samplers by minimizing an forward or reverse KL divergence at each level of nesting. NVI is applicable to many commonly-used importance sampling strategies and provides a mechanism for learning intermediate densities, which can serve as heuristics to guide the sampler. Our experiments apply NVI to (a) sample from a multimodal distribution using a learned annealing path (b) learn heuristics that approximate the likelihood of future observations in a hidden Markov model and (c) to perform amortized inference in hierarchical deep generative models. We observe that optimizing nested objectives leads to improved sample quality in terms of log average weight and effective sample size.
Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger
Bayesian optimization (BO) is a popular method for efficiently inferring optima of an expensive black-box function via a sequence of queries. Existing informationtheoretic BO procedures aim to make queries that most reduce the uncertainty about optima, where the uncertainty is captured by Shannon entropy. However, an optimal measure of uncertainty would, ideally, factor in how we intend to use the inferred quantity in some downstream procedure. In this paper, we instead consider a generalization of Shannon entropy from work in statistical decision theory [13, 39], which contains a broad class of uncertainty measures parameterized by a problem-specific loss function corresponding to a downstream task. We first show that special cases of this entropy lead to popular acquisition functions used in BO procedures such as knowledge gradient, expected improvement, and entropy search. We then show how alternative choices for the loss yield a flexible family of acquisition functions that can be customized for use in novel optimization settings.
Invariant and Transportable Representations for Anti-Causal Domain Shifts and Victor Veitch Department of Computer Science, University of Chicago Department of Statistics, University of Chicago
Real-world classification problems must contend with domain shift, the (potential) mismatch between the domain where a model is deployed and the domain(s) where the training data was gathered. Methods to handle such problems must specify what structure is common between the domains and what varies. A natural assumption is that causal (structural) relationships are invariant in all domains. Then, it is tempting to learn a predictor for label Y that depends only on its causal parents. However, many real-world problems are "anti-causal" in the sense that Y is a cause of the covariates X--in this case, Y has no causal parents and the naive causal invariance is useless.
Everything Unveiled at Google I/O 2025
See all the highlights from Google's annual 2025 Developers Conference in Mountain View, California. Check out the latest updates from Android XR to Gemini Live, and more. Topics Android Artificial Intelligence Google Google Gemini Latest Videos Everything Announced at AMD's 2025 Computex Keynote in 19 Minutes Watch highlights from AMD's Computex press conference. 1 hour ago By Mashable Video'Caught Stealing' trailer sees Zoรซ Kravitz and Austin Butler's cat-sitting gone awry Darren Aronofsky's swaggering new film looks like a rollicking time. Loading... Subscribe These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16 and agree to ourTerms of Use and Privacy Policy.
Android XR Glasses Unveiled at Google I/O 2025
Topics Android Artificial Intelligence Google Google Gemini Latest Videos Everything Announced at AMD's 2025 Computex Keynote in 19 Minutes Watch highlights from AMD's Computex press conference. 1 hour ago By Mashable Video'Caught Stealing' trailer sees Zoรซ Kravitz and Austin Butler's cat-sitting gone awry Darren Aronofsky's swaggering new film looks like a rollicking time. Loading... Subscribe These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16 and agree to ourTerms of Use and Privacy Policy. See you at your inbox! Mashable is a registered trademark of Ziff Davis and may not be used by third parties without express written permission.
SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained model debugging and analysis Roberto Novoa
However, there are only a few datasets that include concept-level meta-labels and most of these meta-labels are relevant for natural images that do not require domain expertise. Previous densely annotated datasets in medicine focused on meta-labels that are relevant to a single disease such as osteoarthritis or melanoma. In dermatology, skin disease is described using an established clinical lexicon that allows clinicians to describe physical exam findings to one another. To provide a medical dataset densely annotated by domain experts with annotations useful across multiple disease processes, we developed SkinCon: a skin disease dataset densely annotated by dermatologists. SkinCon includes 3230 images from the Fitzpatrick 17k skin disease dataset densely annotated with 48 clinical concepts, 22 of which have at least 50 images representing the concept. The concepts used were chosen by two dermatologists considering the clinical descriptor terms used to describe skin lesions.
Amid technical glitches, California's e-bike incentive program promises to be ready for new applicants
A surge of applicants vying for a chance to be chosen for a voucher worth up to 2,000 for the California E-Bike Incentive Program triggered an error in the program's website, blocking everyone from applying. Officials say they've fixed the glitch for the next round of applications next week. The California E-Bike Incentive Program, launched by the California Air Resources Board, was established to help lower cost barriers to alternative methods of transportation such as e-bikes, with the goal of getting cars off the road and reduce greenhouse gas emissions. Eligible residents must be 18 years or older with an annual household income less than 300% of the Federal Poverty Level. The vouchers can be used toward the purchase of an electric bike.
Joint inference and input optimization in equilibrium networks Shaojie Bai Carnegie Mellon University Carnegie Mellon University J. Zico Kolter
Many tasks in deep learning involve optimizing over the inputs to a network to minimize or maximize some objective; examples include optimization over latent spaces in a generative model to match a target image, or adversarially perturbing an input to worsen classifier performance. Performing such optimization, however, is traditionally quite costly, as it involves a complete forward and backward pass through the network for each gradient step. In a separate line of work, a recent thread of research has developed the deep equilibrium (DEQ) model, a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. In this paper, we show that there is a natural synergy between these two settings. Although, naively using DEQs for these optimization problems is expensive (owing to the time needed to compute a fixed point for each gradient step), we can leverage the fact that gradientbased optimization can itself be cast as a fixed point iteration to substantially improve the overall speed. That is, we simultaneously both solve for the DEQ fixed point and optimize over network inputs, all within a single "augmented" DEQ model that jointly encodes both the original network and the optimization process. Indeed, the procedure is fast enough that it allows us to efficiently train DEQ models for tasks traditionally relying on an "inner" optimization loop. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.