Goto

Collaborating Authors

 Venigalla, Abhinav


BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

arXiv.org Artificial Intelligence

Large language models such as OpenAI's GPT-4 have become the dominant technology in modern natural language processing (Liu et al., 2023; Zhao et al., 2023). Trained on large corpora to predict the next token and refined with human feedback (Brown et al., 2020; Ouyang et al., 2022; Ziegler et al., 2020), these models develop impressive capabilities in areas such as summarization and questionanswering (Zhang et al., 2023; Goyal et al., 2023; Karpukhin et al., 2020). While the focus has been on these models' performance when responding to general English prompts, it is clear there is potential for specialist models to impact biomedical research and healthcare (Arora and Arora, 2023; Shah et al., 2023; Thirunavukarasu et al., 2023). Such applications include information retrieval and summarization from the ever-expanding biomedical literature (Wang et al., 2021; Yang, 2020), clinical information such as physician notes in electronic health records, and radiology reports (Murray et al., 2021; Feblowitz et al., 2011; Zhang et al., 2018). Improving domain-specific language models will help accelerate biomedical discovery, drive down healthcare costs, and improve patient care. Large, general models like GPT-4 and Med-PaLM 2 have set new standards for performance on question-answering and information extraction (Kung et al., 2022; Singhal et al., 2023a,b), but there are several drawbacks to these models. They are costly to train and utilize. Compute for training and inference of large language models have increased 10-to 100-fold since 2015 (Sevilla et al., 2022), translating to extremely high financial and


MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

arXiv.org Artificial Intelligence

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.


Representation range needs for 16-bit neural network training

arXiv.org Artificial Intelligence

Deep learning has grown rapidly thanks to its state-of-the-art performance across a wide range of real-world applications. While neural networks have been trained using IEEE-754 binary32 arithmetic, the rapid growth of computational demands in deep learning has boosted interest in faster, low precision training. Mixed-precision training that combines IEEE-754 binary16 with IEEE-754 binary32 has been tried, and other $16$-bit formats, for example Google's bfloat16, have become popular. In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes; denormal numbers extend the representation range. This raises questions of how much exponent range is needed, of whether there is a format between binary16 (5 exponent bits) and bfloat16 (8 exponent bits) that works better than either of them, and whether or not denormals are necessary. In the current paper we study the need for denormal numbers for mixed-precision training, and we propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff. We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations or eliminates the need for denormal numbers altogether. And, for a number of fully connected and convolutional neural networks in computer vision and natural language processing, 1/6/9 achieves numerical parity to standard mixed-precision.


Adaptive Braking for Mitigating Gradient Delay

arXiv.org Machine Learning

Neural network training is commonly accelerated by using multiple synchronized workers to compute gradient updates in parallel. Asynchronous methods remove synchronization overheads and improve hardware utilization at the cost of introducing gradient delay, which impedes optimization and can lead to lower final model performance. We introduce Adaptive Braking (AB), a modification for momentum-based optimizers that mitigates the effects of gradient delay. AB dynamically scales the gradient based on the alignment of the gradient and the velocity. This can dampen oscillations along high curvature directions of the loss surface, stabilizing and accelerating asynchronous training. We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays $D \geq$ 32 update steps with minimal drop in final test accuracy.