The last few years have seen the rise of transformer deep learning architectures to build natural language processing (NLP) model families. The adaptations of the transformer architecture in models such as BERT, RoBERTa, T5, GPT-2, and DistilBERT outperform previous NLP models on a wide range of tasks, such as text classification, question answering, summarization, and text generation. These models are exponentially growing larger in size from several million parameters to several hundred billion parameters. As the number of model parameters increases, so does the computational infrastructure that is necessary to train these models. This requires a significant amount of time, skill, and compute resources to train and optimize the models.
Hugging Face is the technology startup, with an active open-source community, that drove the worldwide adoption of transformer-based models thanks to its eponymous Transformers library. Earlier this year, Hugging Face and AWS collaborated to enable you to train and deploy over 10,000 pre-trained models on Amazon SageMaker. For more information on training Hugging Face models at scale on SageMaker, refer to AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models and the sample notebooks. In this post, we discuss different methods to create a SageMaker endpoint for a Hugging Face model. If you're unfamiliar with transformer-based models and their place in the natural language processing (NLP) landscape, here is an overview.
Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped in the development of state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for text analysis have also evolved. This necessitates the requirement for a fully managed service that can be integrated into applications using API calls without the need for extensive machine learning (ML) expertise. AWS offers pre-trained AWS AI services like Amazon Comprehend, which can effectively handle NLP use cases involving classification, text summarization, entity recognition, and more to gather insights from text.
But again, this is just a first attempt. When the project is in the experimentation phase, these two parameters can and should be changed to see if the model performance changes. If you're already familiar with text generation, you might know there are many more parameters to influence the text a model generates, such as beam search, sampling, and temperature. These parameters give you more control over the text that is being generated, for example make the text more fluent and less repetitive.
Amazon SageMaker Serverless Inference (Preview) was recently announced at re:Invent 2021 as a new model hosting feature that lets customers serve model predictions without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. Serverless Inference is a new deployment capability that complements SageMaker's existing options for deployment that include: SageMaker Real-Time Inference for workloads with low latency requirements in the order of milliseconds, SageMaker Batch Transform to run predictions on batches of data, and SageMaker Asynchronous Inference for inferences with large payload sizes or requiring long processing times. Serverless Inference means that you don't need to configure and manage the underlying infrastructure hosting your models. When you host your model on a Serverless Inference endpoint, simply select the memory and max concurrent invocations. Then, SageMaker will automatically provision, scale, and terminate compute capacity based on the inference request volume.