Country
Shape Context: A New Descriptor for Shape Matching and Object Recognition
Belongie, Serge, Malik, Jitendra, Puzicha, Jan
We develop an approach to object recognition based on matching shapes and using a resulting measure of similarity in a nearest neighbor classifier. The key algorithmic problem here is that of finding pointwise correspondences between an image shape and a stored prototype shape. We introduce a new shape descriptor, the shape context, which makes this possible, using a simple and robust algorithm. We demonstrate that shape contexts greatly simplify recovery of correspondences between points of two given shapes. Once shapes are aligned, shape contexts are used to define a robust score for measuring shape similarity. We have used this score in a nearest-neighbor classifier for recognition of hand written digits as well as 3D objects, using exactly the same distance function.
Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
Tchorz, Jรผrgen, Kleinschmidt, Michael, Kollmeier, Birger
For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which represent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modulation frequencies in higher stages of the mammalian auditory system. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppression is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speakerindependent digit recognition experiments and compared to noise suppression by Spectral Subtraction. 1 Introduction One of the major problems in automatic speech recognition (ASR) systems is their lack of robustness in noise, which severely degrades their usefulness in many practical applications. Several proposals have been made to increase the robustness of ASR systems, e.g. by model compensation or more noise-robust feature extraction [1, 2]. Another method to increase robustness of ASR systems is to suppress the background noise before feature extraction. Classical approaches for single-channel noise suppression are Spectral Subtraction [3] and related schemes, e.g.
FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks
Slaney, Malcolm, Covell, Michele
FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices.
Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech
Saul, Lawrence K., Allen, Jont B.
An eigenvalue method is developed for analyzing periodic structure in speech. Signals are analyzed by a matrix diagonalization reminiscent of methods for principal component analysis (PCA) and independent component analysis (ICA). Our method-called periodic component analysis (1l"CA)-uses constructive interference to enhance periodic components of the frequency spectrum and destructive interference to cancel noise. The front end emulates important aspects of auditory processing, such as cochlear filtering, nonlinear compression, and insensitivity to phase, with the aim of approaching the robustness of human listeners. The method avoids the inefficiencies of autocorrelation at the pitch period: it does not require long delay lines, and it correlates signals at a clock rate on the order of the actual pitch, as opposed to the original sampling rate.
Minimum Bayes Error Feature Selection for Continuous Speech Recognition
Saon, George, Padmanabhan, Mukund
Two avenues will be explored: the first is to maximize the ()-average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of (). While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task. 1 Introduction Modern speech recognition systems use cepstral features characterizing the short-term spectrum of the speech signal for classifying frames into phonetic classes. These features are augmented with dynamic information from the adjacent frames to capture transient spectral events in the signal.
One Microphone Source Separation
Source separation, or computational auditory scene analysis, attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting ("masking") of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function. I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering.
Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
Parra, Lucas C., Spence, Clay, Sajda, Paul
We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1 Introduction Recently, considerable attention has been paid to understanding and modeling the non-Gaussian or "higher-order" properties of natural signals, particularly images. Several non-Gaussian properties have been identified and studied.
Factored Semi-Tied Covariance Matrices
A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters.
Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
III, John W. Fisher, Darrell, Trevor, Freeman, William T., Viola, Paul A.
People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a nonparametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation.
Speech Denoising and Dereverberation Using Probabilistic Models
Attias, Hagai, Platt, John C., Acero, Alex, Deng, Li
This paper presents a unified probabilistic framework for denoising and dereverberation of speech signals. The framework transforms the denoising and dereverberation problems into Bayes-optimal signal estimation. The key idea is to use a strong speech model that is pre-trained on a large data set of clean speech. Computational efficiency is achieved by using variational EM, working in the frequency domain, and employing conjugate priors. The framework covers both single and multiple microphones. We apply this approach to noisy reverberant speech signals and get results substantially better than standard methods.