AITopics | Krishnaswamy, Arvindh

Collaborating Authors

Krishnaswamy, Arvindh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Isik, Umut, Giri, Ritwik, Phansalkar, Neerad, Valin, Jean-Marc, Helwani, Karim, Krishnaswamy, Arvindh

arXiv.org Machine LearningAug-10-2020

These tend to have very large weight matrices Neural network applications generally benefit from larger-sized in the early layers, where the architecture could benefit from a models, but for current speech enhancement models, larger more hierarchical development of features. On the other hand, scale networks often suffer from decreased robustness to the in standard 2D U-Net models where kernels move in both the variety of real-world use cases beyond what is encountered in time and frequency directions [13], early layer activations are training data. We introduce several innovations that lead to better blind to what frequency they operate in - even in the case when large neural networks for speech enhancement. The novel padding is used, these early features' receptive fields have not PoCoNet architecture is a convolutional neural network that, yet reached the edges of the time-frequency image. Our proposed with the use of frequency-positional embeddings, is able to architecture has the advantages of both options, it is a more efficiently build frequency-dependent features in the early 2D U-Net (with DenseNet blocks and self-attention) with small layers. A semi-supervised method helps increase the amount kernels, and can therefore develop features hierarchically, but of conversational training data by pre-enhancing noisy datasets, can also take into account frequency information in early layers improving performance on real recordings.

deep learning, enhancement, neural network, (17 more...)

arXiv.org Machine Learning

2008.0447

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Channel-Attention Dense U-Net for Multichannel Speech Enhancement

Tolooshams, Bahareh, Giri, Ritwik, Song, Andrew H., Isik, Umut, Krishnaswamy, Arvindh

arXiv.org Machine LearningJan-30-2020

Traditionally, beamforming techniques have been employed, where a linear spatial filter is estimated, per frequency, to boost the signal from the desired target direction while attenuating the interferences from other directions by utilizing second-order statistics, e.g., spatial covariance of speech and noise [1]. In recent years, deep learning (DL) based supervised speech enhancement techniques have achieved significant success [2], specifically for monaural/single-channel case. Motivated by this success, a recent line of work proposes to combine supervised single-channel techniques with unsupervised beamforming methods for multichan-nel case [3, 4]. These approaches are broadly known as neural beam-forming, where a neural network estimates the second-order statistics of speech and noise, using estimated time-frequency (TF) masks, after which the beamformer is applied to linearly combine the multi-channel mixture to produce clean speech. However, the performance of neural beamforming is limited by the nature of beamforming, a linear spatial filter per frequency bin. This work was done while B. Tolooshams and A. H. Song were interns at Amazon Web Services. Another line of work [5, 6] proposes to use spatial features along with spectral information to estimate TF masks. Most of these approaches have an explicit step to extract spatial features such as interchannel time/phase/level difference (ITD/IPD/ILD).

deep learning, neural network, speech enhancement, (20 more...)

arXiv.org Machine Learning

2001.11542

Country:

North America > United States > Massachusetts (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.95)

Add feedback