Goto

Collaborating Authors

 phoneme classification


LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Neural Information Processing Systems

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings--5 larger than the next comparable dataset and 50 larger than most. This unprecedented'depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.


MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

arXiv.org Artificial Intelligence

For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks.


LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

arXiv.org Artificial Intelligence

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings -- 5$\times$ larger than the next comparable dataset and 50$\times$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.


Understanding Probe Behaviors through Variational Bounds of Mutual Information

arXiv.org Artificial Intelligence

With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information.


An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification

Neural Information Processing Systems

An artificial neural network is developed to recognize spatio-temporal bipolar patterns associatively. The function of a formal neuron is generalized by replacing multiplication with convolution, weights with transfer functions, and thresholding with nonlinear transform following adaptation. The Hebbian learn(cid:173) ing rule and the delta learning rule are generalized accordingly, resulting in the learning of weights and delays. The neural network which was first developed for spatial patterns was thus generalized for spatio-temporal patterns. It was tested using a set of bipolar input patterns derived from speech signals, showing robust classification of 30 model phonemes.


Ensemble Methods for Phoneme Classification

Neural Information Processing Systems

This paper investigates a number of ensemble methods for improv(cid:173) ing the performance of phoneme classification for use in a speech recognition system. Two ensemble methods are described; boosting and mixtures of experts, both in isolation and in combination. Re(cid:173) sults are presented on two speech recognition databases: an isolated word database and a large vocabulary continuous speech database. These results show that principled ensemble methods such as boost(cid:173) ing and mixtures provide superior performance to more naive en(cid:173) semble methods such as averaging.


Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Neural Information Processing Systems

This paper describes a new acoustic model based on variational Gaussian process dynamical system (VGPDS) for phoneme classification. The proposed model overcomes the limitations of the classical HMM in modeling the real speech data, by adopting a nonlinear and nonparametric model. In our model, the GP prior on the dynamics function enables representing the complex dynamic structure of speech, while the GP prior on the emission function successfully models the global dependency over the observations. Additionally, we introduce variance constraint to the original VGPDS for mitigating sparse approximation error of the kernel matrix. The effectiveness of the proposed model is demonstrated with extensive experimental results including parameter estimation, classification performance on the synthetic and benchmark datasets.


Review -- Bidirectional LSTM

#artificialintelligence

[2005 IJCNN] [Bidirectional LSTM (BLSTM)] Framewise Phoneme Classification with Bidirectional LSTM Networks [2005 ICANN] [Bidirectional LSTM (BLSTM)] Bidirectional LSTM Networks for Improved Phoneme…


Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Neural Information Processing Systems

This paper describes a new acoustic model based on variational Gaussian process dynamical system (VGPDS) for phoneme classification. The proposed model overcomes the limitations of the classical HMM in modeling the real speech data, by adopting a nonlinear and nonparametric model. In our model, the GP prior on the dynamics function enables representing the complex dynamic structure of speech, while the GP prior on the emission function successfully models the global dependency over the observations. Additionally, we introduce variance constraint to the original VGPDS for mitigating sparse approximation error of the kernel matrix. The effectiveness of the proposed model is demonstrated with extensive experimental results including parameter estimation, classification performance on the synthetic and benchmark datasets.


Semi-supervised Learning with Sparse Autoencoders in Phone Classification

arXiv.org Machine Learning

We propose the application of a semi-supervised learning method to improve the performance of acoustic modelling for automatic speech recognition based on deep neural net- works. As opposed to unsupervised initialisation followed by supervised fine tuning, our method takes advantage of both unlabelled and labelled data simultaneously through mini- batch stochastic gradient descent. We tested the method with varying proportions of labelled vs unlabelled observations in frame-based phoneme classification on the TIMIT database. Our experiments show that the method outperforms standard supervised training for an equal amount of labelled data and provides competitive error rates compared to state-of-the-art graph-based semi-supervised learning techniques.