We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. Four years ago, a paper by Yann LeCun and his collaborators was rejected by the leading computer vision conference on the grounds that it used neural networks and therefore provided no insight into how to design a vision system. At the time, most computer vision researchers believed that a vision system needed to be carefully hand-designed using a detailed understanding of the nature of the task. They assumed that the task of classifying objects in natural images would never be solved by simply presenting examples of images and the names of the objects they contained to a neural network that acquired all of its knowledge from this training data. What many in the vision research community failed to appreciate was that methods that require careful hand-engineering by a programmer who understands the domain do not scale as well as methods that replace the programmer with a powerful general-purpose learning procedure.
Like the large-vocabulary speech recognition paper we looked at yesterday, today's paper has also been described as a landmark paper in the history of deep learning. The ImageNet dataset contains over 1.5 million labeled high-resolution images of objects in roughly 22,000 categories. The annual ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) competition uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. There are 1.2M training images, 50,000 validation images, and 150,000 testing images. For reporting error rates, a model predicts the top 5 most likely labels.
Convolutional Neural Network (CNN) is the state-of-the-art for image classification task. Here we have briefly discussed different components of CNN. In this paper, We have explained different CNN architectures for image classification. Through this paper, we have shown advancements in CNN from LeNet-5 to latest SENet model. We have discussed the model description and training details of each model. We have also drawn a comparison among those models.
Convolutional neural networks are fantastic for visual recognition tasks. Good ConvNets are beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: 'higher the number of hidden layers, better the network'. AlexNet, VGG, Inception, ResNet are some of the popular networks. Why do these networks work so well?
Good ConvNets are beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: 'higher the number of hidden layers, better the network'. AlexNet, VGG, Inception, ResNet are some of the popular networks. Why do these networks work so well? Why do they have the structures they have?