Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
Recently, the Google Brain team published their EfficientDet model for object detection with the goal of crystallizing architecture decisions into a scalable framework that can be easily applied to other use cases in object detection. The paper concludes that EfficientDet outperforms similar-sized models on benchmark datasets. At Roboflow, we found that the base EfficientDet model generalizes to custom datasets hosted on our platform. In this blog post, we explore the rationale behind the decisions that were made in forming the final EfficientDet model, how EfficientDet works, and how EfficientDet compares to popular object detection models like YOLOv3, Faster R-CNN, and MobileNet. Before exploring the model, here are some key areas that have prevented image detection systems from being deployed to real life use cases.
Ensembling is a simple and popular technique for boosting evaluation performance by training multiple models (e.g., with different initializations) and aggregating their predictions. This approach is commonly reserved for the largest models, as it is commonly held that increasing the model size provides a more substantial reduction in error than ensembling smaller models. However, we show results from experiments on CIFAR-10 and ImageNet that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute, even when those individual models' weights and hyperparameters are highly optimized. Furthermore, this gap in improvement widens as models become large. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models, especially when the models approach the size of what their dataset can foster. Instead of using the common practice of tuning a single large model, one can use ensembles as a more flexible trade-off between a model's inference speed and accuracy. This also potentially eases hardware design, e.g., an easier way to parallelize the model across multiple workers for real-time or distributed inference.
Google researchers have open sourced EfficientNets, a method for scaling up CNN models that they claim is up to 10 times more efficient than current "state-of-the-art" techniques. The method is detailed in a paper which is being presented at next month's International Conference on Machine Learning, and promises to remove at least some of the "tedious manual tuning" conventional methods require. According to Mingxing Tan, Staff Software Engineer and Quoc V. Le, Principal Scientist, at Google AI, the researchers set out to find a way to scale up a CNN more accurately and efficiently than conventional practice which is to "arbitrarily increase the CNN depth or width, or to use larger input image resolution for training and evaluation." "While these methods do improve accuracy, they usually require tedious manual tuning, and still often yield suboptimal performance," the team points out. Their alternative was to use "a simple yet highly effective compound coefficient to scale up CNNs in a more structured manner. Unlike conventional approaches that arbitrarily scale network dimensions, such as width, depth and resolution, our method uniformly scales each dimension with a fixed set of scaling coefficients."
Members of the Google Brain team and Google AI this week open-sourced EfficientDet, an AI tool that achieves state-of-the-art object detection while using less compute. Creators of the system say it also achieves faster performance when used with CPUs or GPUs than other popular objection detection models like YOLO or AmoebaNet. When tasked with semantic segmentation, another task related to object detection, EfficientDet also achieves exceptional performance. Semantic segmentation experiments were conducted with the PASCAL visual object challenge data set. EfficientDet is the next-generation version of EfficientNet, a family of advanced object detection models made available last year for Coral boards.