SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement
Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. Existing works explore adaptive batching or hand-tuned static large batching, in order to strike a balance between the computational efficiency and the performance. However, these methods can provide only coarse-grained adaption (e.g., at a epoch level) due to the intrinsic expensive calculation or hand tuning requirements. In this paper, we propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e.g., at a mini-batch level) that can achieve stateof-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology by providing a plugin tool that supports mainstream machine learning frameworks. Extensive evaluations on popular benchmarks (e.g., CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78k in BERT-Large pretraining with SQuAD score 90.69 compared to 90.58 reported in previous state-of-the-art with 59k batch size.
LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models You Chen Department of Computer Science Department of Computer Science Tsinghua University
Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain. However, legal applications demand high standards of accuracy, reliability, and fairness. Applying existing LLMs to legal systems without careful evaluation of their potential and limitations could pose significant risks in legal practice. To this end, we introduce a standardized comprehensive Chinese legal benchmark LexEval. This benchmark is notable in the following three aspects: (1) Ability Modeling: We propose a new taxonomy of legal cognitive abilities to organize different tasks.
A Tight Lower Bound and Efficient Reduction for Swap Regret
Swap regret, a generic performance measure of online decision-making algorithms, plays an important role in the theory of repeated games, along with a close connection to correlated equilibria in strategic games. This paper shows an (p TN log N)-lower bound for swap regret, where T and N denote the numbers of time steps and available actions, respectively. Our lower bound is tight up to a constant, and resolves an open problem mentioned, e.g., in the book by Nisan et al. [28]. Besides, we present a computationally efficient reduction method that converts no-external-regret algorithms to no-swap-regret algorithms. This method can be applied not only to the full-information setting but also to the bandit setting and provides a better regret bound than previous results.
Nested Variational Inference Hao Wu Jan-Willem van de Meent
We develop nested variational inference (NVI), a family of methods that learn proposals for nested importance samplers by minimizing an forward or reverse KL divergence at each level of nesting. NVI is applicable to many commonly-used importance sampling strategies and provides a mechanism for learning intermediate densities, which can serve as heuristics to guide the sampler. Our experiments apply NVI to (a) sample from a multimodal distribution using a learned annealing path (b) learn heuristics that approximate the likelihood of future observations in a hidden Markov model and (c) to perform amortized inference in hierarchical deep generative models. We observe that optimizing nested objectives leads to improved sample quality in terms of log average weight and effective sample size.
This beast of a robot vacuum is heavily discounted at Amazon -- save 700 on the Roborock Qrevo Master
SAVE 700: As of May 22, the Roborock Qrevo Master is on sale for 899.99 at Amazon. As of May 22, the Roborock Qrevo Master robot vacuum and mop is on sale for 44% off, now down to 899.99. And with this vacuum, you're getting a whole lot to be excited about. The Qrevo Master handles both vacuuming and mopping, with minimal effort required on your end. Its self-emptying dock means up to seven weeks of hands-free cleaning, and with 10,000Pa suction and the Carpet Boost System, it's seriously effective, removing up to 99% of hair from carpets.
Russia-Ukraine war: List of key events, day 1,183
Russia's Defence Ministry said air defences shot down 105 Ukrainian drones over Russian regions, including 35 over the Moscow region, after the ministry said a day earlier that it had downed more than 300 Ukrainian drones. Kherson Governor Oleksandr Prokudin said one person was killed in a Russian artillery attack on the region. H said over the past day, 35 areas in Kherson, including Kherson city, came under artillery shelling and air attacks, wounding 11 people. Ukrainian President Zelenskyy said the "most intense situation" is in the Donetsk region, and the army is continuing "active operations in the Kursk and Belgorod regions". Russia's Defence Ministry said air defences shot down 105 Ukrainian drones over Russian regions, including 35 over the Moscow region, after the ministry said a day earlier that it had downed more than 300 Ukrainian drones.
Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger
Bayesian optimization (BO) is a popular method for efficiently inferring optima of an expensive black-box function via a sequence of queries. Existing informationtheoretic BO procedures aim to make queries that most reduce the uncertainty about optima, where the uncertainty is captured by Shannon entropy. However, an optimal measure of uncertainty would, ideally, factor in how we intend to use the inferred quantity in some downstream procedure. In this paper, we instead consider a generalization of Shannon entropy from work in statistical decision theory [13, 39], which contains a broad class of uncertainty measures parameterized by a problem-specific loss function corresponding to a downstream task. We first show that special cases of this entropy lead to popular acquisition functions used in BO procedures such as knowledge gradient, expected improvement, and entropy search. We then show how alternative choices for the loss yield a flexible family of acquisition functions that can be customized for use in novel optimization settings.
Causal Discovery from Event Sequences by Local Cause-Effect Attribution
Sequences of events, such as crashes in the stock market or outages in a network, contain strong temporal dependencies, whose understanding is crucial to react to and influence future events. In this paper, we study the problem of discovering the underlying causal structure from event sequences. To this end, we introduce a new causal model, where individual events of the cause trigger events of the effect with dynamic delays. We show that in contrast to existing methods based on Granger causality, our model is identifiable for both instant and delayed effects. We base our approach on the Algorithmic Markov Condition, by which we identify the true causal network as the one that minimizes the Kolmogorov complexity. As the Kolmogorov complexity is not computable, we instantiate our model using Minimum Description Length and show that the resulting score identifies the causal direction.