Recognizing facial action units (AUs) from spontaneous facial expressions is still a challenging problem. Most recently, CNNs have shown promise on facial AU recognition. However, the learned CNNs are often overfitted and do not generalize well to unseen subjects due to limited AU-coded training images. We proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an incremental boosting layer that selects discriminative neurons from the lower layer and is incrementally updated on successive mini-batches. In addition, a novel loss function that accounts for errors from both the incremental boosted classifier and individual weak classifiers was proposed to fine-tune the IB-CNN.
Standard pattern recognition systems usually involve a segmentation step prior to the recognition step. For example, it is very common in character recognition to segment characters in a pre-processing step then normalize the individual characters and pass them to a recognition engine such as a neural network, as in the work of LeCun et al. 1988, Martin and Pittman (1988). This separation between segmentation and recognition becomes unreliable if the characters are touching each other, touching bounding boxes, broken, or noisy. Other applications such as scene analysis or continuous speech recognition pose similar and more severe segmentation problems. The difficulties encountered in these applications present an apparent dilemma: one cannot recognize the patterns 496 *email@example.comReprint
It turns out that the Boltzmann machines of that era did not scale up well, but other architectures designed by Hinton, LeCun, Bengio, Olshausen, Osindero, Sutskever, Courville, Ng, and others did. Was it the one-layer-at-a-time training technique? GPU clusters that allow faster training? I can't say for sure, and I hope that continued analysis will give us a better picture. But I can say that in speech recognition, computer vision object recognition, the game of Go, and other fields, the difference has been dramatic: error rates go down when you use deep learning, and both these fields have undergone a complete transformation in the last few years: essentially all the teams have chosen deep learning, because it just works.
Encouraged by the success of deep convolutional neural networks on a variety of visual tasks, much theoretical and experimental work has been aimed at understanding and interpreting how vision networks operate. At the same time, deep neural networks have also achieved impressive performance in audio processing applications, both as sub-components of larger systems and as complete end-to-end systems by themselves. Despite their empirical successes, comparatively little is understood about how these audio models accomplish these tasks.In this work, we employ a recently developed statistical mechanical theory that connects geometric properties of network representations and the separability of classes to probe how information is untangled within neural networks trained to recognize speech. We observe that speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties such as words and phonemes are untangled in later layers. Higher level concepts such as parts-of-speech and context dependence also emerge in the later layers of the network.