AITopics | wham

Collaborating Authors

wham

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

WhAM: Towards ATranslative Model of Sperm Whale Vocalization

Neural Information Processing SystemsJun-15-2026, 17:09:52 GMT

Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.92)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WhAM: Towards A Translative Model of Sperm Whale Vocalization

Neural Information Processing SystemsJun-11-2026, 06:18:46 GMT

Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

be1bc7997695495f756312886f566110-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 23:02:20 GMT

In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network(FRCNN) to solvethe separation task. This model containsbottom-up,top-downandlateral connections tofuse information processed atvarious time-scales represented by stages.

artificial intelligence, machine learning, speechandsignalprocessing, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Europe > Germany > Hamburg (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

WhAM: Towards A Translative Model of Sperm Whale Vocalization

Paradise, Orr, Muralikrishnan, Pranav, Chen, Liangyuan, García, Hugo Flores, Pardo, Bryan, Diamant, Roee, Gruber, David F., Gero, Shane, Goldwasser, Shafi

arXiv.org Artificial IntelligenceDec-3-2025

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2512.02206

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)
Health & Medicine (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

Yemini, Yochai, Ben-Ari, Rami, Gannot, Sharon, Fetaya, Ethan

arXiv.org Artificial IntelligenceSep-19-2025

In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the speech distribution, we enable effective decomposition through the inverse problem paradigm. We perform speech separation by sampling from the posterior distributions via a reverse diffusion process, which directly estimates and removes the modelled noise component to recover clean constituent signals. Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach in challenging acoustic environments.

artificial intelligence, international conference, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.14379

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Microsoft is replacing human gamers (and even games) with AI

PCWorldFeb-19-2025, 16:00:00 GMT

In the future, Microsoft suggests, you may be playing AI. No, not on the battlefield, but on games that actually use AI to simulate the entire game itself. As a first step, Microsoft has developed an AI model, called WHAM, that "beta tests" games early in the development cycle using AI instead of human players. Gamers know that realistic AI can turn a good game into something great, like how the older F.E.A.R. games would realistically model how soldiers might react to a hostile, armed player. Microsoft's World and Human Action Model (WHAM) takes the opposite approach -- it tries to figure out how human players will react in a given situation, right down to a specific frame or setup within the existing game world.

artificial intelligence, microsoft, wham, (12 more...)

PCWorld

Industry:

Leisure & Entertainment > Games > Computer Games (0.89)
Government > Military > Army (0.56)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Resource-Efficient Separation Transformer

Della Libera, Luca, Subakan, Cem, Ravanelli, Mirco, Cornell, Samuele, Lepoutre, Frédéric, Grondin, François

arXiv.org Artificial IntelligenceJan-15-2024

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

re-sepformer, separation, speech separation, (14 more...)

arXiv.org Artificial Intelligence

2206.09507

Country:

North America > Canada > Quebec (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)

Add feedback

RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

Tzinis, Efthymios, Adi, Yossi, Ithapu, Vamsi Krishna, Xu, Buye, Smaragdis, Paris, Kumar, Anurag

arXiv.org Artificial IntelligenceAug-3-2022

We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.

enhancement, remixit, speech enhancement, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/JSTSP.2022.3200911

2202.08862

Country:

North America > United States > Washington > King County > Redmond (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (1.00)

Industry: Education > Educational Technology > Educational Software (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Continual self-training with bootstrapped remixing for speech enhancement

Tzinis, Efthymios, Adi, Yossi, Ithapu, Vamsi K., Xu, Buye, Kumar, Anurag

arXiv.org Artificial IntelligenceJan-29-2022

We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap the mixing process by generating artificial mixtures using permuted estimated clean and noise signals. Finally, the student model is trained using the permuted estimated sources as targets while we periodically update teacher's weights using the latest student model. Our experiments show that RemixIT outperforms several previous state-of-the-art self-supervised methods under multiple speech enhancement tasks. Additionally, RemixIT provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model.

artificial intelligence, machine learning, speech enhancement, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP43922.2022.9747463

2110.10103

Country:

North America > United States > Illinois (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

WHAM!: Extending Speech Separation to Noisy Environments

Wichern, Gordon, Antognini, Joe, Flynn, Michael, Zhu, Licheng Richard, McQuinn, Emmett, Crow, Dwight, Manilow, Ethan, Roux, Jonathan Le

arXiv.org Machine LearningJul-2-2019

Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we strive to move the field towards more realistic and challenging scenarios. To that end, we created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area, and are made publicly available. We benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, we still observe substantial gains relative to the noisy signals for most approaches.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

1907.0116

Country:

North America > United States > California > San Francisco County > San Francisco (0.34)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.24)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Speech (0.68)

Add feedback