Goto

Collaborating Authors

 distilling


Moonshine: Distilling with Cheap Convolutions

Neural Information Processing Systems

Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.


Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

arXiv.org Artificial Intelligence

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.


Distilling the Knowledge in Data Pruning

arXiv.org Artificial Intelligence

With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.


Distilling What We Know

Communications of the ACM

The sheer size and complexity of today's generative pretrained transformer (GPT) models is nothing less than astounding. OpenAI's GPT-3, for example, possesses somewhere in the neighborhood of 175 billion parameters, and there is speculation GPT-4 could have as many as 10 trillion parameters.a All of this introduces enormous overhead in terms of required cloud resources, including compute cycles and energy consumption. At the moment, the computer power required to train state-of-the-art artificial intelligence (AI) models is rising at a rate of 15x every two years.b The cost of training a large GPT model can run into the millions of dollars.c


How knowledge distillation compresses neural networks

#artificialintelligence

If you've ever used a neural network to solve a complex problem, you know they can be enormous in size, containing millions of parameters. For instance, the famous BERT model has about 110 million. To illustrate the point, this is the number of parameters for the most common architectures in (natural language processing) NLP, as summarized in the recent State of AI Report 2020 by Nathan Benaich and Ian Hogarth. In Kaggle competitions, the winner models are often ensembles, composed of several predictors. Although they can beat simple models by a large margin in terms of accuracy, their enormous computational costs make them utterly unusable in practice. Is there any way to somehow leverage these powerful but massive models to train state of the art models, without scaling the hardware?


Moonshine: Distilling with Cheap Convolutions

Neural Information Processing Systems

Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data. Papers published at the Neural Information Processing Systems Conference.


Distilling a Neural Network into a soft decision tree

#artificialintelligence

As part of the commitment to continuous (& cutting edge) research at Razorthink Inc, we are coming up with a series of review papers which will screen through the best of research done in the field of deep learning, machine learning, data science and artificial intelligence in general, across the globe. Each week, we will pick up one research paper, break it down to make it easier to understand, take you through the entire research approach, major takeaways and finally bring in the applicability in real use-cases. Our first pick in the series is "Distilling a Neural Network into a soft decision tree" (download link at the bottom) originally written by Nicholas Frosst & Geoffrey Hinton (Google Brain Team). Deep Neural networks have been proven to be very effective in performing tasks that involve classification and prediction based on the complexity of the data. Most importantly, it is highly useful in situations where the input data has a complex relationship with the target variable and the dimensions of the input data is very high.


Distilling a Neural Network Into a Soft Decision Tree

arXiv.org Machine Learning

Deep neural networks have proved to be a very effective way to perform classification tasks. They excel when the input data is high dimensional, the relationship between the input and the output is complicated, and the number of labeled training examples is large [Szegedy et al., 2015, Wu et al., 2016, Jozefowicz et al., 2016, Graves et al., 2013]. But it is hard to explain why a learned network makes a particular classification decision on a particular test case. This is due to their reliance on distributed hierarchical representations. If we could take the knowledge acquired by the neural net and express the same knowledge in a model that relies on hierarchical decisions instead, explaining a particular decision would be much easier. We describe a way of using a trained neural net to create a type of soft decision tree that generalizes better than one learned directly from the training data.