to

Transfer learning & The art of using Pre-trained Models in Deep Learning

I hope that you would now be able to apply pre-trained models to your problem statements. Be sure that the pre-trained model you have selected has been trained on a similar data set as the one that you wish to use it on. There are various architectures people have tried on different types of data sets and I strongly encourage you to go through these architectures and apply them on your own problem statements. Please feel free to discuss your doubts and concerns in the comments section.

tanglang96/MDENAS

Here we propose a method to extremely accelerate NAS, without reinforcement learning or gradient, just by sampling architectures from a distribution and comparing these architectures, estimating their relative performance rather than absolute performance, iteratively updating parameters of the distribution while training. Search codes will be released by Sherwood later!

Approach pre-trained deep learning models with caution

It seems like using these pre-trained models have become a new standard for industry best practices. After all, why wouldn't you take advantage of a model that's been trained on more data and compute than you could ever muster by yourself? Advances within the NLP space have also encouraged the use of pre-trained language models like GPT and GPT-2, AllenNLP's ELMo, Google's BERT, and Sebastian Ruder and Jeremy Howard's ULMFiT (for an excellent over of these models, see this TOPBOTs post). One common technique for leveraging pretrained models is feature extraction, where you're retrieving intermediate representations produced by the pretrained model and using those representations as inputs for a new model. These final fully-connected layers are generally assumed to capture information that is relevant for solving a new task.

LogME: Practical Assessment of Pre-trained Models for Transfer Learning

This paper studies task adaptive pre-trained model selection, an \emph{underexplored} problem of assessing pre-trained models so that models suitable for the task can be selected from the model zoo without fine-tuning. A pilot work~\cite{nguyen_leep:_2020} addressed the problem in transferring supervised pre-trained models to classification tasks, but it cannot handle emerging unsupervised pre-trained models or regression tasks. In pursuit of a practical assessment method, we propose to estimate the maximum evidence (marginalized likelihood) of labels given features extracted by pre-trained models. The maximum evidence is \emph{less prone to over-fitting} than the likelihood, and its \emph{expensive computation can be dramatically reduced} by our carefully designed algorithm. The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning: a pre-trained model with high LogME is likely to have good transfer performance. LogME is fast, accurate, and general, characterizing it as \emph{the first practical assessment method for transfer learning}. Compared to brute-force fine-tuning, LogME brings over $3000\times$ speedup in wall-clock time. It outperforms prior methods by a large margin in their setting and is applicable to new settings that prior methods cannot deal with. It is general enough to diverse pre-trained models (supervised pre-trained and unsupervised pre-trained), downstream tasks (classification and regression), and modalities (vision and language). Code is at \url{https://github.com/thuml/LogME}.

Investigating Transferability in Pretrained Language Models

While probing is a common technique for identifying knowledge in the representations of pretrained models, it is unclear whether this technique can explain the downstream success of models like BERT which are trained end-to-end during finetuning. To address this question, we compare probing with a different measure of transferability: the decrease in finetuning performance of a partially-reinitialized model. This technique reveals that in BERT, layers with high probing accuracy on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. In addition, dataset size impacts layer transferability: the less finetuning data one has, the more important the middle and later layers of BERT become. Furthermore, BERT does not simply find a better initializer for individual layers; instead, interactions between layers matter and reordering BERT's layers prior to finetuning significantly harms evaluation metrics. These results provide a way of understanding the transferability of parameters in pretrained language models, revealing the fluidity and complexity of transfer learning in these models.