Goto

Collaborating Authors

 Helwani, Karim


Sound Source Separation Using Latent Variational Block-Wise Disentanglement

arXiv.org Artificial Intelligence

While neural network approaches have made significant strides in resolving classical signal processing problems, it is often the case that hybrid approaches that draw insight from both signal processing and neural networks produce more complete solutions. In this paper, we present a hybrid classical digital signal processing/deep neural network (DSP/DNN) approach to source separation (SS) highlighting the theoretical link between variational autoencoder and classical approaches to SS. We propose a system that transforms the single channel under-determined SS task to an equivalent multichannel over-determined SS problem in a properly designed latent space. The separation task in the latent space is treated as finding a variational block-wise disentangled representation of the mixture. We show empirically, that the design choices and the variational formulation of the task at hand motivated by the classical signal processing theoretical results lead to robustness to unseen out-of-distribution data and reduction of the overfitting risk. To address the resulting permutation issue we explicitly incorporate a novel differentiable permutation loss function and augment the model with a memory mechanism to keep track of the statistics of the individual sources.


Neural Harmonium: An Interpretable Deep Structure for Nonlinear Dynamic System Identification with Application to Audio Processing

arXiv.org Artificial Intelligence

Improving the interpretability of deep neural networks has recently gained increased attention, especially when the power of deep learning is leveraged to solve problems in physics. Interpretability helps us understand a model's ability to generalize and reveal its limitations. In this paper, we introduce a causal interpretable deep structure for modeling dynamic systems. Our proposed model makes use of the harmonic analysis by modeling the system in a time-frequency domain while maintaining high temporal and spectral resolution. Moreover, the model is built in an order recursive manner which allows for fast, robust, and exact second order optimization without the need for an explicit Hessian calculation. To circumvent the resulting high dimensionality of the building blocks of our system, a neural network is designed to identify the frequency interdependencies. The proposed model is illustrated and validated on nonlinear system identification problems as required for audio signal processing tasks. Crowd-sourced experimentation contrasting the performance of the proposed approach to other state-of-the-art solutions on an acoustic echo cancellation scenario confirms the effectiveness of our method for real-life applications.


Learning Linear Groups in Neural Networks

arXiv.org Artificial Intelligence

Employing equivariance in neural networks leads to greater parameter efficiency and improved generalization performance through the encoding of domain knowledge in the architecture; however, the majority of existing approaches require an a priori specification of the desired symmetries. We present a neural network architecture, Linear Group Networks (LGNs), for learning linear groups acting on the weight space of neural networks. Linear groups are desirable due to their inherent interpretability, as they can be represented as finite matrices. LGNs learn groups without any supervision or knowledge of the hidden symmetries in the data and the groups can be mapped to well known operations in machine learning. We use LGNs to learn groups on multiple datasets while considering different downstream tasks; we demonstrate that the linear group structure depends on both the data distribution and the considered task.


PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

arXiv.org Machine Learning

These tend to have very large weight matrices Neural network applications generally benefit from larger-sized in the early layers, where the architecture could benefit from a models, but for current speech enhancement models, larger more hierarchical development of features. On the other hand, scale networks often suffer from decreased robustness to the in standard 2D U-Net models where kernels move in both the variety of real-world use cases beyond what is encountered in time and frequency directions [13], early layer activations are training data. We introduce several innovations that lead to better blind to what frequency they operate in - even in the case when large neural networks for speech enhancement. The novel padding is used, these early features' receptive fields have not PoCoNet architecture is a convolutional neural network that, yet reached the edges of the time-frequency image. Our proposed with the use of frequency-positional embeddings, is able to architecture has the advantages of both options, it is a more efficiently build frequency-dependent features in the early 2D U-Net (with DenseNet blocks and self-attention) with small layers. A semi-supervised method helps increase the amount kernels, and can therefore develop features hierarchically, but of conversational training data by pre-enhancing noisy datasets, can also take into account frequency information in early layers improving performance on real recordings.