Goto

Collaborating Authors

 Country


A Classification-based Cocktail-party Processor

Neural Information Processing Systems

At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID.


One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals

Neural Information Processing Systems

Speech dereverberation is desirable with a view to achieving, for example, robust speech recognition in the real world. However, it is still a challenging problem, especially when using a single microphone. Although blind equalization techniques have been exploited, they cannot deal with speech signals appropriately because their assumptions are not satisfied by speech signals. We propose a new dereverberation principle based on an inherent property of speech signals, namely quasi-periodicity. The present methods learn the dereverberation filter from a lot of speech data with no prior knowledge of the data, and can achieve high quality speech dereverberation especially when the reverberation time is long.


Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

Neural Information Processing Systems

Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28-33% while the standard eigenvoice approach can only match the performance of the speaker-independent model.


Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Neural Information Processing Systems

Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram.


A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Neural Information Processing Systems

Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM's. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM's and the generative classifiers in speaker identification and image classification.


Phonetic Speaker Recognition with Support Vector Machines

Neural Information Processing Systems

A recent area of significant progress in speaker recognition is the use of high level features--idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.


Prediction on Spike Data Using Kernel Algorithms

Neural Information Processing Systems

We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms.


Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

Neural Information Processing Systems

Decoding is a strategy that allows us to assess the amount of information neurons can provide about certain aspects of the visual scene. In this study, we develop a method based on Bayesian sequential updating and the particle filtering algorithm to decode the activity of V1 neurons in awake monkeys. A distinction in our method is the use of Volterra kernels to filter the particles, which live in a high dimensional space. This parametric Bayesian decoding scheme is compared to the optimal linear decoder and is shown to work consistently better than the linear optimal decoder. Interestingly, our results suggest that for decoding in real time, spike trains of as few as 10 independent but similar neurons would be sufficient for decoding a critical scene variable in a particular class of visual stimuli. The reconstructed variable can predict the neural activity about as well as the actual signal with respect to the Volterra kernels.


A Probabilistic Model of Auditory Space Representation in the Barn Owl

Neural Information Processing Systems

The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive fields [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we first formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.


Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Neural Information Processing Systems

When we model a higher order functions, such as learning and memory, we face a difficulty of comparing neural activities with hidden variables that depend on the history of sensory and motor signals and the dynamics of the network. Here, we propose novel method for estimating hidden variables of a learning agent, such as connection weights from sequences of observable variables. Bayesian estimation is a method to estimate the posterior probability of hidden variables from observable data sequence using a dynamic model of hidden and observable variables. In this paper, we apply particle filter for estimating internal parameters and metaparameters of a reinforcement learning model. We verified the effectiveness of the method using both artificial data and real animal behavioral data.