cochlea
CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition
Ziogas, Ioannis, Alfalahi, Hessa, Khandoker, Ahsan H., Hadjileontiadis, Leontios J.
Self-supervised learning (SSL) for automated speech recognition in terms of its emotional content, can be heavily degraded by the presence noise, affecting the efficiency of modeling the intricate temporal and spectral informative structures of speech. Recently, SSL on large speech datasets, as well as new audio-specific SSL proxy tasks, such as, temporal and frequency masking, have emerged, yielding superior performance compared to classic approaches drawn from the image augmentation domain. Our proposed contribution builds upon this successful paradigm by introducing CochCeps-Augment, a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations. Specifically, we utilize the newly introduced bio-inspired cochlear cepstrogram (CCGRAM) to derive noise robust representations of input speech, that are then further refined through a self-supervised learning scheme. The latter employs SimCLR to generate contrastive views of a CCGRAM through masking of its angle and quefrency dimensions. Our experimental approach and validations on the emotion recognition K-EmoCon benchmark dataset, for the first time via a speaker-independent approach, features unsupervised pre-training, linear probing and fine-tuning. Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis, showing the added value of incorporating bio-inspired masking as an informative augmentation task for self-supervision. Our code for implementing CochCeps-Augment will be made available at: https://github.com/GiannisZgs/CochCepsAugment.
Neuromorphic Auditory Perception by Neural Spiketrum
Tang, Huajin, Gu, Pengjie, Wijekoon, Jayawan, Alsakkal, MHD Anas, Wang, Ziming, Shen, Jiangrong, Yan, Rui
Neuromorphic computing holds the promise to achieve the energy efficiency and robust learning performance of biological neural systems. To realize the promised brain-like intelligence, it needs to solve the challenges of the neuromorphic hardware architecture design of biological neural substrate and the hardware amicable algorithms with spike-based encoding and learning. Here we introduce a neural spike coding model termed spiketrum, to characterize and transform the time-varying analog signals, typically auditory signals, into computationally efficient spatiotemporal spike patterns. It minimizes the information loss occurring at the analog-to-spike transformation and possesses informational robustness to neural fluctuations and spike losses. The model provides a sparse and efficient coding scheme with precisely controllable spike rate that facilitates training of spiking neural networks in various auditory perception tasks. We further investigate the algorithm-hardware co-designs through a neuromorphic cochlear prototype which demonstrates that our approach can provide a systematic solution for spike-based artificial intelligence by fully exploiting its advantages with spike-based computation.
MS-MT: Multi-Scale Mean Teacher with Contrastive Unpaired Translation for Cross-Modality Vestibular Schwannoma and Cochlea Segmentation
Zhao, Ziyuan, Xu, Kaixin, Yeo, Huai Zhe, Yang, Xulei, Guan, Cuntai
Domain shift has been a long-standing issue for medical image segmentation. Recently, unsupervised domain adaptation (UDA) methods have achieved promising cross-modality segmentation performance by distilling knowledge from a label-rich source domain to a target domain without labels. In this work, we propose a multi-scale self-ensembling based UDA framework for automatic segmentation of two key brain structures i.e., Vestibular Schwannoma (VS) and Cochlea on high-resolution T2 images. First, a segmentation-enhanced contrastive unpaired image translation module is designed for image-level domain adaptation from source T1 to target T2. Next, multi-scale deep supervision and consistency regularization are introduced to a mean teacher network for self-ensemble learning to further close the domain gap. Furthermore, self-training and intensity augmentation techniques are utilized to mitigate label scarcity and boost cross-modality segmentation performance. Our method demonstrates promising segmentation performance with a mean Dice score of 83.8% and 81.4% and an average asymmetric surface distance (ASSD) of 0.55 mm and 0.26 mm for the VS and Cochlea, respectively in the validation phase of the crossMoDA 2022 challenge.
Low-Level Physiological Implications of End-to-End Learning of Speech Recognition
de Gibson, Louise Coppieters, Garner, Philip N.
Current speech recognition architectures perform very well from the point of view of machine learning, hence user interaction. This suggests that they are emulating the human biological system well. We investigate whether the inference can be inverted to provide insights into that biological system; in particular the hearing mechanism. Using SincNet, we confirm that end-to-end systems do learn well known filterbank structures. However, we also show that wider band-width filters are important in the learned structure. Whilst some benefits can be gained by initialising both narrow and wide-band filters, physiological constraints suggest that such filters arise in mid-brain rather than the cochlea. We show that standard machine learning architectures must be modified to allow this process to be emulated neurally.
Perfecting pitch perception
New research from MIT neuroscientists suggests that natural soundscapes have shaped our sense of hearing, optimizing it for the kinds of sounds we most often encounter. In a study reported Dec. 14 in the journal Nature Communications, researchers led by McGovern Institute for Brain Research associate investigator Josh McDermott used computational modeling to explore factors that influence how humans hear pitch. Their model's pitch perception closely resembled that of humans -- but only when it was trained using music, voices, or other naturalistic sounds. Humans' ability to recognize pitch -- essentially, the rate at which a sound repeats -- gives melody to music and nuance to spoken language. Although this is arguably the best-studied aspect of human hearing, researchers are still debating which factors determine the properties of pitch perception, and why it is more acute for some types of sounds than others. McDermott, who is also an associate professor in MIT's Department of Brain and Cognitive Sciences, and an Investigator with the Center for Brains, Minds, and Machines (CBMM) at MIT, is particularly interested in understanding how our nervous system perceives pitch because cochlear implants, which send electrical signals about sound to the brain in people with profound deafness, don't replicate this aspect of human hearing very well.
Bayesian Logistic Shape Model Inference: application to cochlea image segmentation
Zihao, Wang, Thomas, Demarcy, Clair, Vandersteen, Dan, Gnansia, Charles, Raffaelli, Nicolas, Guevara, Hervé, Delingette
Incorporating shape information is essential for the delineation of many organs and anatomical structures in medical images. While previous work has mainly focused on parametric spatial transformations applied on reference template shapes, in this paper, we address the Bayesian inference of parametric shape models for segmenting medical images with the objective to provide interpretable results. The proposed framework defines a likelihood appearance probability and a prior label probability based on a generic shape function through a logistic function. A reference length parameter defined in the sigmoid controls the trade-off between shape and appearance information. The inference of shape parameters is performed within an Expectation-Maximisation approach where a Gauss-Newton optimization stage allows to provide an approximation of the posterior probability of shape parameters. This framework is applied to the segmentation of cochlea structures from clinical CT images constrained by a 10 parameter shape model. It is evaluated on three different datasets, one of which includes more than 200 patient images. The results show performances comparable to supervised methods and better than previously proposed unsupervised ones. It also enables an analysis of parameter distributions and the quantification of segmentation uncertainty including the effect of the shape model.
A Deep Learning based Fast Signed Distance Map Generation
Wang, Zihao, Vandersteen, Clair, Demarcy, Thomas, Gnansia, Dan, Raffaelli, Charles, Guevara, Nicolas, Delingette, Hervé
Signed distance map (SDM) is a common representation of surfaces in medical image analysis and machine learning. The computational complexity of SDM for 3D parametric shapes is often a bottleneck in many applications, thus limiting their interest. In this paper, we propose a learning based SDM generation neural network which is demonstrated on a tridimensional cochlea shape model parameterized by 4 shape parameters. The proposed SDM Neural Network generates a cochlea signed distance map depending on four input parameters and we show that the deep learning approach leads to a 60 fold improvement in the time of computation compared to more classical SDM generation methods. Therefore, the proposed approach achieves a good trade-off between accuracy and efficiency.
Smart bracelet manipulates bone-conducing tech
A new smart bracelet allows you to use your index finger as a phone using technology that conducts sound via vibrations through your wrist bone. The Get bracelet, which costs £200 ($250)connects to your smartphone and translates the sound from your device into vibrations, conducted into the fingers. Users just have to stick a finger in their ear to speak on the phone, and make outgoing calls by using the bracelet's voice recognition technology. Because the device uses vibration only, instead of sound, conversations can't be overheard by people nearby. Get has no buttons, and no screen, but uses your voice and gestures to control its features, according to its Italian inventors.
Active Bidirectional Coupling in a Cochlear Chip
We present a novel cochlear model implemented in analog very large scale integration (VLSI) technology that emulates nonlinear active cochlear behavior. This silicon cochlea includes outer hair cell (OHC) electromotility through active bidirectional coupling (ABC), a mechanism we proposed in which OHC motile forces, through the microanatomical organization of the organ of Corti, realize the cochlear amplifier. Our chip measurements demonstrate that frequency responses become larger and more sharply tuned when ABC is turned on; the degree of the enhancement decreases with input intensity as ABC includes saturation of OHC forces.
Active Bidirectional Coupling in a Cochlear Chip
We present a novel cochlear model implemented in analog very large scale integration (VLSI) technology that emulates nonlinear active cochlear behavior. This silicon cochlea includes outer hair cell (OHC) electromotility through active bidirectional coupling (ABC), a mechanism weproposed in which OHC motile forces, through the microanatomical organizationof the organ of Corti, realize the cochlear amplifier. Our chip measurements demonstrate that frequency responses become larger and more sharply tuned when ABC is turned on; the degree ofthe enhancement decreases with input intensity as ABC includes saturation of OHC forces.