Goto

Collaborating Authors

 train transformer


Mnemosyne: Learning to Train Transformers with Transformers

Neural Information Processing Systems

In this work, we propose a new class of learnable optimizers, called Mnemosyne. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a) fine-tuning a wide range of Vision Transformers (ViTs) from medium-size architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT models and (c) soft prompt-tuning large 11B+ T5XXL models. We complement our results with a comprehensive theoretical analysis of the compact associative memory used by Mnemosyne which we believe was never done before.


Mnemosyne: Learning to Train Transformers with Transformers

Neural Information Processing Systems

In this work, we propose a new class of learnable optimizers, called Mnemosyne. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a) fine-tuning a wide range of Vision Transformers (ViTs) from medium-size architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT models and (c) soft prompt-tuning large 11B T5XXL models. We complement our results with a comprehensive theoretical analysis of the compact associative memory used by Mnemosyne which we believe was never done before.


Scaling Training of HuggingFace Transformers With Determined

#artificialintelligence

Training complex state-of-the-art natural language processing (NLP) models is now a breeze, thanks to HuggingFace -- making it an essential open-source go-to for data scientists and machine learning engineers to implement Transformers models and configure them as state-of-the-art NLP models with straightforward library calls. As a result, the library has become crucial for training NLP models, like in Baidu or Alibaba, and has contributed to state-of-the-art results in several NLP tasks. Our friends at Determined AI are hosting an exciting lunch-and-learn covering training HuggingFace Transformers at scale using Determined! Learn to train Transformers with distributed training, hyperparameter searches, and cheap spot instances -- all without modifying code. Please consider joining on Wednesday, June 30th at 10 AM PT for a hands-on tutorial from Liam Li, a Senior Machine Learning Engineer at Determined AI, and Angela Jiang, a Product Manager at Determined AI (lunch included!).