AITopics | conv-tasnet

Collaborating Authors

conv-tasnet

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Real-time Low-latency Music Source Separation using Hybrid Spectrogram-TasNet

Venkatesh, Satvik, Benilov, Arthur, Coleman, Philip, Roskam, Frederic

arXiv.org Artificial IntelligenceFeb-27-2024

There have been significant advances in deep learning for music demixing in recent years. However, there has been little attention given to how these neural networks can be adapted for real-time low-latency applications, which could be helpful for hearing aids, remixing audio streams and live shows. In this paper, we investigate the various challenges involved in adapting current demixing models in the literature for this use case. Subsequently, inspired by the Hybrid Demucs architecture, we propose the Hybrid Spectrogram Time-domain Audio Separation Network HS-TasNet, which utilises the advantages of spectral and waveform domains. For a latency of 23 ms, the HS-TasNet obtains an overall signal-to-distortion ratio (SDR) of 4.65 on the MusDB test set, and increases to 5.55 with additional training data. These results demonstrate the potential of efficient demixing for real-time low-latency music applications.

separation, source separation, tasnet, (14 more...)

arXiv.org Artificial Intelligence

2402.17701

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine (0.75)
Media (0.66)
Leisure & Entertainment (0.48)

Technology:

Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)

Add feedback

Resource-constrained stereo singing voice cancellation

Borrelli, Clara, Rae, James, Basaran, Dogac, McVicar, Matt, Souden, Mehrez, Mauch, Matthias

arXiv.org Artificial IntelligenceJan-22-2024

We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning model parameters and expanding the training set. Moreover, we highlight the benefits a stereo model brings by introducing a new metric which detects attenuation inconsistencies between channels. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial, confirming the effectiveness of our techniques in stringent listening tests.

dataset, separation, source separation, (15 more...)

arXiv.org Artificial Intelligence

2401.12068

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

Gonzalez, Philippe, Alstrøm, Tommy Sonne, May, Tobias

arXiv.org Artificial IntelligenceMar-31-2023

The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.

batch size, sequence, training time, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10097075

2301.10587

Country:

Europe > Denmark (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling strategies for on-device low-complexity source separation with Conv-Tasnet

Ali, Mohamed Nabih, Paissan, Francesco, Falavigna, Daniele, Brutti, Alessio

arXiv.org Artificial IntelligenceMar-6-2023

Recently, several very effective neural approaches for single-channel speech separation have been presented in the literature. However, due to the size and complexity of these models, their use on low-resource devices, e.g. for hearing aids, and earphones, is still a challenge and established solutions are not available yet. Although approaches based on either pruning or compressing neural models have been proposed, the design of a model architecture suitable for a certain application domain often requires heuristic procedures not easily portable to different low-resource platforms. Given the modular nature of the well-known Conv-Tasnet speech separation architecture, in this paper we consider three parameters that directly control the overall size of the model, namely: the number of residual blocks, the number of repetitions of the separation blocks and the number of channels in the depth-wise convolutions, and experimentally evaluate how they affect the speech separation performance. In particular, experiments carried out on the Libri2Mix show that the number of dilated 1D-Conv blocks is the most critical parameter and that the usage of extra-dilation in the residual blocks allows reducing the performance drop.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2303.03005

Country:

Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Liu, Xiaoyu, Li, Xu, Serrà, Joan

arXiv.org Artificial IntelligenceOct-26-2022

FBANK, as a Single channel target speaker separation (TSS) aims at extracting simple signal processing method, has been ignored as an enrollment a speaker's voice from a mixture of multiple talkers given an enrollment option in previous literature. SSL are a class of powerful models that utterance of that speaker. A typical deep learning TSS learn problem-agnostic speech features from unlabelled data [12-framework consists of an upstream model that obtains enrollment 14], and we hypothesize that such broader information (compared to speaker embeddings and a downstream model that performs the separation SID) could benefit TSS enrollment. Note that, unlike [15], which conditioned on the embeddings. In this paper, we look into uses SSL as the input mixture features for blind speaker separation, several important but overlooked aspects of the enrollment embeddings, we limit SSL to offline processing the enrollment utterance, since including the suitability of the widely used speaker identification TSS often requires real-time low-complexity processing for the mixtures embeddings, the introduction of the log-mel filterbank and selfsupervised [2-5]. Finally, we consider a cross-dataset evaluation to assess embeddings, and the embeddings' cross-dataset generalization the generalization of the enrollment embeddings [16], which is another capability. Our results show that the speaker identification important but overlooked aspect in previous TSS research.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2210.12635

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Individualized Conditioning and Negative Distances for Speaker Separation

Sun, Tao, Abuhajar, Nidal, Gong, Shuyu, Wang, Zhewei, Smith, Charles D., Wang, Xianhui, Xu, Li, Liu, Jundong

arXiv.org Artificial IntelligenceOct-12-2022

Speaker separation aims to extract multiple voices from a mixed signal. In this paper, we propose two speaker-aware designs to improve the existing speaker separation solutions. The first model is a speaker conditioning network that integrates speech samples to generate individualized speaker conditions, which then provide informed guidance for a separation module to produce well-separated outputs. The second design aims to reduce non-target voices in the separated speech. To this end, we propose negative distances to penalize the appearance of any non-target voice in the channel outputs, and positive distances to drive the separated voices closer to the clean targets. We explore two different setups, weighted-sum and triplet-like, to integrate these two distances to form a combined auxiliary loss for the separation networks. Experiments conducted on LibriMix demonstrate the effectiveness of our proposed models.

artificial intelligence, machine learning, separation, (15 more...)

arXiv.org Artificial Intelligence

2210.06368

Country:

North America > United States > Kentucky > Fayette County > Lexington (0.14)
North America > United States > Ohio > Athens County > Athens (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)

Genre: Research Report (0.50)

Industry:

Media (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

On permutation invariant training for speech source separation

Liu, Xiaoyu, Pons, Jordi

arXiv.org Artificial IntelligenceFeb-9-2021

Deep CASA, an spectrogram-based model, to Conv-TasNet, which uses very short waveform frames (such as 2 ms). We find that tPIT We study permutation invariant training (PIT), which targets at the based on such short waveform frames can be challenging. Therefore, permutation ambiguity problem for speaker independent source separation we propose performing tPIT in a pre-trained latent space--which models. We extend two state-of-the-art PIT strategies. First, allows for a more meaningful feature space for tPIT than the short we look at the two-stage speaker separation and tracking algorithm waveform frames. Further, when training the clustering model, Deep based on frame level PIT (tPIT) and clustering, which was originally CASA employs a memory and computationally expensive pairwise proposed for the STFT domain, and we adapt it to work with similarity loss that does not scale for waveform inputs. We propose waveforms and over a learned latent space. Further, we propose an a loss that reduces the complexity from quadratic to linear, making efficient clustering loss scalable to waveform models.

conv-tasnet, permutation error, separation, (12 more...)

arXiv.org Artificial Intelligence

2102.04945

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)

Genre: Research Report (1.00)

Industry: Government > Regional Government (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Nonlinear Residual Echo Suppression Based on Multi-stream Conv-TasNet

Chen, Hongsheng, Xiang, Teng, Chen, Kai, Lu, Jing

arXiv.org Machine LearningMay-15-2020

Acoustic echo cannot be entirely removed by linear adaptive filters due to the nonlinear relationship between the echo and far-end signal. Usually a post processing module is required to further suppress the echo. In this paper, we propose a residual echo suppression method based on the modification of fully convolutional time-domain audio separation network (Conv-TasNet). Both the residual signal of the linear acoustic echo cancellation system, and the output of the adaptive filter are adopted to form multiple streams for the Conv-TasNet, resulting in more effective echo suppression while keeping a lower latency of the whole system. Simulation results validate the efficacy of the proposed method in both single-talk and double-talk situations.

artificial intelligence, machine learning, opération, (17 more...)

arXiv.org Machine Learning

2005.07631

Country:

North America > United States (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > India (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Music Source Separation in the Waveform Domain

Défossez, Alexandre, Usunier, Nicolas, Bottou, Léon, Bach, Francis

arXiv.org Machine LearningNov-27-2019

Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we first show that an adaptation of Conv-Tasnet (Luo \& Mesgarani, 2019), a waveform-to-waveform model for source separation for speech, significantly beats the state-of-the-art on the MusDB dataset, the standard benchmark of multi-instrument source separation. Second, we observe that Conv-Tasnet follows a masking approach on the input signal, which has the potential drawback of removing parts of the relevant source without the capacity to reconstruct it. We propose Demucs, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder. Experiments on the MusDB dataset show that Demucs beats previously reported results in terms of signal to distortion ratio (SDR), but lower than Conv-Tasnet. Human evaluations show that Demucs has significantly higher quality (as assessed by mean opinion score) than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Additional experiments with a larger dataset suggest that the gap in SDR between Demucs and Conv-Tasnet shrinks, showing that our approach is promising.

conv-tasnet, separation, source separation, (16 more...)

arXiv.org Machine Learning

1911.13254

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback