concrete distribution
KDGAN: Knowledge Distillation with Generative Adversarial Networks
Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates.
KDGAN: Knowledge Distillation with Generative Adversarial Networks
Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science > Data Mining (0.94)
Accuracy-Preserving Calibration via Statistical Modeling on Probability Simplex
Esaki, Yasushi, Nakamura, Akihiro, Kawano, Keisuke, Tokuhisa, Ryoko, Kutsuna, Takuro
Classification models based on deep neural networks (DNNs) must be calibrated to measure the reliability of predictions. Some recent calibration methods have employed a probabilistic model on the probability simplex. However, these calibration methods cannot preserve the accuracy of pre-trained models, even those with a high classification accuracy. We propose an accuracy-preserving calibration method using the Concrete distribution as the probabilistic model on the probability simplex. We theoretically prove that a DNN model trained on cross-entropy loss has optimality as the parameter of the Concrete distribution. We also propose an efficient method that synthetically generates samples for training probabilistic models on the probability simplex. We demonstrate that the proposed method can outperform previous methods in accuracy-preserving calibration tasks using benchmarks.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland (0.04)
- Europe > Spain (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Properties of the Concrete distribution
Like the more well-known Dirichlet distribution, it includes the uniform distribution on the simplex, but is otherwise distinct from the Dirichlet distribution. There is a temperature parameter, named by analogy with the Boltzmann (or Gibbs) distribution in thermodynamics. In the zero temperature limit, samples become concentrated at the corners of the simplex and the distribution approximates a categorical distribution: the limit of a continuous distribution approximates a discrete distribution, hence the portmanteau "Concrete". There are additionally K parameters that are easily interpreted as unnormalized probabilities for the K categories in this limit; a single constraint reduces these to K 1 independent parameters. The Concrete distribution has applications in machine learning for stochastic neural networks with discrete random variables, such as random sampling from discrete distributions and estimation of parameter gradients through backpropagation. One way of constructing the Concrete distribution is from combining Gumbel distributions through the softmax function, hence its alternative name of the Gumbel-softmax distribution. The differentiability of the softmax function, unlike the argmax function, allows for backpropagation.
In-Distribution Interpretability for Challenging Modalities
Heiß, Cosmas, Levie, Ron, Resnick, Cinjon, Kutyniok, Gitta, Bruna, Joan
It is widely recognized that the predictions of deep neural networks are difficult to parse relative to simpler approaches. However, the development of methods to investigate the mode of operation of such models has advanced rapidly in the past few years. Recent work introduced an intuitive framework which utilizes generative models to improve on the meaningfulness of such explanations. In this work, we display the flexibility of this method to interpret diverse and challenging modalities: music and physical simulations of urban environments.
- North America > United States > New York (0.04)
- Europe > Norway > Northern Norway > Troms > Tromsø (0.04)
KDGAN: Knowledge Distillation with Generative Adversarial Networks
Wang, Xiaojie, Zhang, Rui, Sun, Yu, Qi, Jianzhong
Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates.
DBSN: Measuring Uncertainty through Bayesian Learning of Deep Neural Network Structures
Deng, Zhijie, Luo, Yucen, Zhu, Jun, Zhang, Bo
Bayesian neural networks (BNNs) introduce uncertainty estimation to deep networks by performing Bayesian inference on network weights. However, such models bring the challenges of inference, and further BNNs with weight uncertainty rarely achieve superior performance to standard models. In this paper, we investigate a new line of Bayesian deep learning by performing Bayesian reasoning on the structure of deep neural networks. Drawing inspiration from the neural architecture search, we define the network structure as gating weights on the redundant operations between computational nodes, and apply stochastic variational inference techniques to learn the structure distributions of networks. Empirically, the proposed method substantially surpasses the advanced deep neural networks across a range of classification and segmentation tasks. More importantly, our approach also preserves benefits of Bayesian principles, producing improved uncertainty estimation than the strong baselines including MC dropout and variational BNNs algorithms (e.g. noisy EK-FAC).
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.64)
Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment
Yasuda, Yusuke, Wang, Xin, Yamagishi, Junichi
Sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods. Recently, hard-attention-based methods have been proposed to prevent fatal alignment errors, but their sampling method of discrete alignment is poorly investigated. This research investigates various combinations of sampling methods and probability distributions for alignment transition modeling in a hard-alignment-based sequence-to-sequence TTS method called SSNT-TTS. We clarify the common sampling methods of discrete variables including greedy search, beam search, and random sampling from a Bernoulli distribution in a more general way. Furthermore, we introduce the binary Concrete distribution to model discrete variables more properly. The results of a listening test shows that deterministic search is more preferable than stochastic search, and the binary Concrete distribution is robust with stochastic search for natural alignment transition.
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.75)