elastic inference
MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. We present MatFormer, a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.
MatFormer: Nested Transformer for Elastic Inference
Devvrit, null, Kudugunta, Sneha, Kusupati, Aditya, Dettmers, Tim, Chen, Kaifeng, Dhillon, Inderjit, Tsvetkov, Yulia, Hajishirzi, Hannaneh, Kakade, Sham, Farhadi, Ali, Jain, Prateek
Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting more fine-grained control over relevant tradeoffs, including latency, cost, and accuracy. This work introduces MatFormer, a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks. This training procedure allows for the Mix'n'Match of model granularities across layers -- i.e., a trained universal MatFormer model enables extraction of hundreds of accurate smaller models, which were never explicitly optimized. We empirically demonstrate MatFormer's effectiveness across different model classes (decoders & encoders), modalities (language & vision), and scales (up to 2.6B parameters). We find that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B, each exhibiting comparable validation loss and one-shot downstream evaluations to their independently trained counterparts. Furthermore, we observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, we showcase that speculative decoding with the accurate and consistent submodels extracted from MatFormer can further reduce inference latency.
Optimizing TensorFlow model serving with Kubernetes and Amazon Elastic Inference Amazon Web Services
The only aspect of the code that isn't straightforward is the need to enable EC2 instance termination protection while workers are processing videos, as shown in the following code example: After the job processes, a similar API call disables termination protection. This example application uses termination protection because the jobs are long-running, and you don't want an EC2 instance terminated during a scale-in event if it is still processing a video. You can easily modify the inference code and optimize it for your use case, so this post doesn't spend further time examining it. To review the Dockerfile for the inference code, see the amazon-elastic-inference-eks GitHub repo, under the /Dockerfile directory. The code itself is in the test.py
- Retail > Online (0.40)
- Information Technology > Services (0.40)
Run ONNX models with Amazon Elastic Inference Amazon Web Services
At re:Invent 2018, AWS announced Amazon Elastic Inference (EI), a new service that lets you attach just the right amount of GPU-powered inference acceleration to any Amazon EC2 instance. This is also available for Amazon SageMaker notebook instances and endpoints, bringing acceleration to built-in algorithms and to deep learning environments. In this blog post, I show how to use the models in the ONNX Model Zoo on GitHub to perform inference by using MXNet with Elastic Inference Accelerator (EIA) as a backend. Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75 percent. Amazon Elastic Inference provides support for Apache MXNet, TensorFlow, and ONNX models.
- Retail > Online (0.40)
- Information Technology > Services (0.40)
AI Weekly: 6 important machine learning developments from AWS re:Invent
This week in Las Vegas, Amazon rolled out dozens of new features, upgrades, and new products at AWS re:Invent. Here's a quick roundup of news out of the annual conference that may matter to members of the AI community. A disproportionate amount of money is spent on inference versus training when it comes to AI models, AWS CEO Andy Jassy said, and GPUs can be terribly inefficient. To address these issues, Amazon custom-designed a chip named Inferentia due out next year and created Elastic Inference, a service that identifies parts of a neural network that can benefit from acceleration. To speed up training of AI models, Amazon introduced AWS-Optimized TensorFlow, which can train a model with the ResNet-50 benchmark in 14 minutes.
- North America > United States > Nevada > Clark County > Las Vegas (0.25)
- North America > United States > Texas (0.06)
- North America > United States > California (0.05)
- (3 more...)
Amazon's self-driving AI robo-car – THE TRUTH (it's a few inches in size) • The Register
It already has quite a few smart code confections: Rekognition, Lex, Polly, Transcribe, Comprehend, Translate, Sagemaker, and Greengrass, among others. At its re:Invent gathering in Las Vegas today, AWS threw a handful of new flavors into the mix, among them: Elastic Inference, SageMaker GroundTruth, SageMaker RL, Amazon SageMaker Neo, Personalize, Forecast, Textract, and Comprehend Medical. It also teased a machine-learning inference chip called Inferentia, and a small radio-controlled car called DeepRacer for executing autonomous driving models in the real-world and terrifying pets. It's a 1/18th scale race car that's ostensibly intended to help people understand and implement reinforcement learning. It may also help with customer acquisition, retention, and spending.
- Transportation > Ground > Road (0.72)
- Information Technology > Robotics & Automation (0.72)