Liu, Yuzong
On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation
Yang, Gene-Ping, Gu, Yue, Tang, Qingming, Du, Dongsu, Liu, Yuzong
Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.
Small-footprint slimmable networks for keyword spotting
Akhtar, Zuhaib, Khursheed, Mohammad Omar, Du, Dongsu, Liu, Yuzong
Dynamic neural networks are In this work, we present Slimmable Neural Networks applied another paradigm in which the network dynamically adapts to the problem of small-footprint keyword spotting. We show its computation graph and parameters to different inputs and that slimmable neural networks allow us to create super-nets permits tradeoff between accuracy and inference efficiency from Convolutional Neural Networks and Transformers, from [3]. Another notable work Once-for-All (OFA) network was which sub-networks of different sizes can be extracted. We proposed in [4], which allows one to train one super-network demonstrate the usefulness of these models on in-house voice once and derive multiple sub-networks with different resource assistant data and Google Speech Commands, and focus our contraint requirements. OFA also mitigates the large computational efforts on models for the on-device use case, limiting ourselves cost in conventional neural architecture search (NAS) to less than 250k parameters. We show that slimmable by decoupling the network training and search.
Self-supervised speech representation learning for keyword-spotting with light-weight transformers
Gao, Chenyang, Gu, Yue, Caliva, Francesco, Liu, Yuzong
Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters and propose a mechanism to enhance utterance-wise distinction, which proves crucial for improving performance on classification tasks. On the Google speech commands v2 dataset, the proposed method applied to the Auto-Regressive Predictive Coding S3RL led to a 1.2% accuracy improvement compared to training from scratch. On an in-house KS dataset with four different keywords, it provided 6% to 23.7% relative false accept improvement at fixed false reject rate. We argue this demonstrates the applicability of S3RL approaches to light-weight models for KS and confirms S3RL is a powerful alternative to traditional supervised learning for resource-constrained applications.
Fixed-point quantization aware training for on-device keyword-spotting
Macha, Sashank, Oza, Om, Escott, Alex, Caliva, Francesco, Armitano, Robbie, Cheekatmalla, Santosh Kumar, Parthasarathi, Sree Hari Krishnan, Liu, Yuzong
Computational requirements can be reduced further using lowprecision Fixed-point (FXP) inference has proven suitable for embedded inference via quantization, which allows increased operations devices with limited computational resources, and yet model training per accessed memory byte [5, 7]. Such quantization is typically is continually performed in floating-point (FLP). FXP training achieved by means of post-training-quantization (PTQ) [8], which has not been fully explored and the non-trivial conversion from however causes severe information loss affecting model accuracy. FLP to FXP presents unavoidable performance drop. We propose To maintain overall accuracy for quantized DNNs, quantization can a novel method to train and obtain FXP convolutional keywordspotting be incorporated in the training phase leading to quantization-awaretraining (KWS) models. We combine our methodology with two (QAT). QAT introduces quantization noise during training quantization-aware-training (QAT) techniques - squashed weight by means of deterministic rounding [9, 10, 11], reparametrization distribution and absolute cosine regularization for model parameters, [12, 13] or regularization [14, 15] among few techniques, and propose techniques for extending QAT over transient allowing DNNs to adapt to inference quantization. Notable work variables, otherwise neglected by previous paradigms. Experimental has shown that with QAT model parameters can be learned at binary results on the Google Speech Commands v2 dataset show that we and ternary precision [16, 17].
Decoding Ipsilateral Finger Movements from ECoG Signals in Humans
Liu, Yuzong, Sharma, Mohit, Gaona, Charles, Breshears, Jonathan, Roland, Jarod, Freudenburg, Zachary, Leuthardt, Eric, Weinberger, Kilian Q.
Several motor related Brain Computer Interfaces (BCIs) have been developed over the years that use activity decoded from the contralateral hemisphere to operate devices. Contralateralprimary motor cortex is also the region most severely affected by hemispheric stroke. Recent studies have identified ipsilateral cortical activity in planning of motor movements and its potential implications for a stroke relevant BCI.The most fundamental functional loss after a hemispheric stroke is the loss of fine motor control of the hand. Thus, whether ipsilateral cortex encodes finger movements is critical to the potential feasibility of BCI approaches in the future. This study uses ipsilateral cortical signals from humans (using ECoG) to decode finger movements. We demonstrate, for the first time, successful finger movement detection using machine learning algorithms. Our results show high decoding accuracies in all cases which are always above chance. We also show that significant accuracies can be achieved with the use of only a fraction of all the features recorded and that these core features are consistent with previous physiological findings.The results of this study have substantial implications for advancing neuroprosthetic approaches to stroke populations not currently amenable to existing BCI techniques.