rkl
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Wu, Taiqiang, Tao, Chaofan, Wang, Jiahao, Zhao, Zhe, Wong, Ngai
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
- North America > United States > California > San Francisco County > San Francisco (0.05)
- North America > United States > New York (0.05)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.05)
- (12 more...)
AdvNF: Reducing Mode Collapse in Conditional Normalising Flows using Adversarial Learning
Kanaujia, Vikas, Scheurer, Mathias S., Arora, Vipul
Deep generative models complement Markov-chain-Monte-Carlo methods for efficiently sampling from high-dimensional distributions. Among these methods, explicit generators, such as Normalising Flows (NFs), in combination with the Metropolis Hastings algorithm have been extensively applied to get unbiased samples from target distributions. We systematically study central problems in conditional NFs, such as high variance, mode collapse and data efficiency. We propose adversarial training for NFs to ameliorate these problems. Experiments are conducted with low-dimensional synthetic datasets and XY spin models in two spatial dimensions.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India > Uttar Pradesh > Kanpur (0.04)
- Asia > China (0.04)
Training Normalizing Flows with the Precision-Recall Divergence
Verine, Alexandre, Negrevergne, Benjamin, Pydi, Muni Sreenivas, Chevaleyre, Yann
Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.
- Europe > Austria > Vienna (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (4 more...)
RKL: a general, invariant Bayes solution for Neyman-Scott
Neyman-Scott is a classic example of an estimation problem with a partially-consistent posterior, for which standard estimation methods tend to produce inconsistent results. Past attempts to create consistent estimators for Neyman-Scott have led to ad-hoc solutions, to estimators that do not satisfy representation invariance, to restrictions over the choice of prior and more. We present a simple construction for a general-purpose Bayes estimator, invariant to representation, which satisfies consistency on Neyman-Scott over any nondegenerate prior. We argue that the good attributes of the estimator are due to its intrinsic properties, and generalise beyond Neyman-Scott as well. Keywords: Neyman-Scott, consistent estimation, minEKL, Kullback-Leibler, Bayes estimation, invariance 1. Introduction In [24], Neyman and Scott introduced a problem in consistent estimation that has since been studied extensively in many fields (see [18] for a review).
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York (0.04)
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- (2 more...)