Europe
An Improved Scheme for Detection and Labelling in Johansson Displays
Fanti, Claudio, Polito, Marzia, Perona, Pietro
Consider a number of moving points, where each point is attached to a joint of the human body and projected onto an image plane. Johannson showed that humans can effortlessly detect and recognize the presence of other humans from such displays. This is true even when some of the body points are missing (e.g. because of occlusion) and unrelated clutter points are added to the display. We are interested in replicating this ability in a machine. To this end, we present a labelling and detection scheme in a probabilistic framework. Our method is based on representing the joint probability density of positions and velocities of body points with a graphical model, and using Loopy Belief Propagation to calculate a likely interpretation of the scene. Furthermore, we introduce a global variable representing the body's centroid. Experiments on one motion-captured sequence suggest that our scheme improves on the accuracy of a previous approach based on triangulated graphical models, especially when very few parts are visible. The improvement is due both to the more general graph structure we use and, more significantly, to the introduction of the centroid variable.
Geometric Analysis of Constrained Curves
Srivastava, Anuj, Mio, Washington, Liu, Xiuwen, Klassen, Eric
We present a geometric approach to statistical shape analysis of closed curves in images. The basic idea is to specify a space of closed curves satisfying given constraints, and exploit the differential geometry of this space to solve optimization and inference problems. We demonstrate this approach by: (i) defining and computing statistics of observed shapes, (ii) defining and learning a parametric probability model on shape space, and (iii) designing a binary hypothesis test on this space.
A Sampled Texture Prior for Image Super-Resolution
Pickup, Lyndsey C., Roberts, Stephen J., Zisserman, Andrew
Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single-input-image case. Here we present a domain-specific image prior in the form of a p.d.f.
Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification
Littlewort, G.C., Bartlett, M.S., Fasel, I.R., Chenu, J., Kanda, T., Ishiguro, H., Movellan, J.R.
Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM's. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.
Learning a Rare Event Detection Cascade by Direct Feature Selection
Wu, Jianxin, Rehg, James M., Mullin, Matthew D.
Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as online learning.
Eye Micro-movements Improve Stimulus Detection Beyond the Nyquist Limit in the Peripheral Retina
Hennig, Matthias H., Wörgötter, Florentin
Even under perfect fixation the human eye is under steady motion (tremor, microsaccades, slow drift). The "dynamic" theory of vision [1, 2] states that eye-movements can improve hyperacuity. According to this theory, eye movements are thought to create variable spatial excitation patterns on the photoreceptor grid, which will allow for better spatiotemporal summation at later stages.
Nonlinear Processing in LGN Neurons
Bonin, Vincent, Mante, Valerio, Carandini, Matteo
According to a widely held view, neurons in lateral geniculate nucleus (LGN) operate on visual stimuli in a linear fashion. There is ample evidence, however, that LGN responses are not entirely linear. To account for nonlinearities we propose a model that synthesizes more than 30 years of research in the field. Model neurons have a linear receptive field, and a nonlinear, divisive suppressive field. The suppressive field computes local root-meansquare contrast. To test this model we recorded responses from LGN of anesthetized paralyzed cats. We estimate model parameters from a basic set of measurements and show that the model can accurately predict responses to novel stimuli. The model might serve as the new standard model of LGN responses. It specifies how visual processing in LGN involves both linear filtering and divisive gain control.
Local Phase Coherence and the Perception of Blur
Wang, Zhou, Simoncelli, Eero P.
Blur is one of the most common forms of image distortion. It can arise from a variety of sources, such as atmospheric scatter, lens defocus, optical aberrations of the lens, and spatial and temporal sensor integration. Human observers are bothered by blur, and our visual systems are quite good at reporting whether an image appears blurred (or sharpened) [1, 2]. However, the mechanism by which this is accomplished is not well understood. Clearly, detection of blur requires some model of what constitutes an unblurred image. In recent years, there has been a surge of interest in the modelling of natural images, both for purposes of improving the performance of image processing and computer vision systems, and also for furthering our understanding of biological visual systems.
A Classification-based Cocktail-party Processor
Roman, Nicoleta, Wang, Deliang, Brown, Guy J.
At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID.
A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
Moreno, Pedro J., Ho, Purdy P., Vasconcelos, Nuno
Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM's. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM's and the generative classifiers in speaker identification and image classification.