acoustical society
Perch 2.0 transfers 'whale' to underwater tasks
Burns, Andrea, Harrell, Lauren, van Merriënboer, Bart, Dumoulin, Vincent, Hamer, Jenny, Denton, Tom
Perch 2.0 is a supervised bioacoustics foundation model pretrained on 14,597 species, including birds, mammals, amphibians, and insects, and has state-of-the-art performance on multiple benchmarks. Given that Perch 2.0 includes almost no marine mammal audio or classes in the training data, we evaluate Perch 2.0 performance on marine mammal and underwater audio tasks through few-shot transfer learning. We perform linear probing with the embeddings generated from this foundation model and compare performance to other pretrained bioacoustics models. In particular, we compare Perch 2.0 with previous multispecies whale, Perch 1.0, SurfPerch, AVES-bio, BirdAVES, and Birdnet V2.3 models, which have open-source tools for transfer-learning and agile modeling. We show that the embeddings from the Perch 2.0 model have consistently high performance for few-shot transfer learning, generally outperforming alternative embedding models on the majority of tasks, and thus is recommended when developing new linear classifiers for marine mammal classification with few labeled examples.
Quieter dental drills may be on the horizon
The high-pitched whine of dentistry tools creates a lot of anxiety, especially for kids. The fear of going to the dentist is called odontophobia. Breakthroughs, discoveries, and DIY tips sent every weekday. If the thought of going to the dentist makes your teeth chatter with fear, you're not alone. At least 15 to 20 percent of adults are believed to have odontophobia--aka dental anxiety--which prevents them from maintaining regular cleanings and dental check-ups .
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.06)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- Asia > Middle East > Republic of Türkiye (0.05)
Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection
Padovese, Bruno, Frazao, Fabio, Dowd, Michael, Joy, Ruth
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.
- North America > Canada > British Columbia (0.05)
- North America > United States > Alaska (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (7 more...)
HergNet: a Fast Neural Surrogate Model for Sound Field Predictions via Superposition of Plane Waves
Calafà, Matteo, Xia, Yuanxin, Jeong, Cheol-Ho
ABSTRACT We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies. Index T erms-- Helmholtz equation, wave fields, room acoustics, physics-informed neural networks 1. INTRODUCTION Several physical phenomena are represented by propagation of waves, especially in fields like acoustics, optics, quantum mechanics, electromagnetism and surface fluid mechanics [1, 2, 3, 4, 5]. Fast and accurate simulations of waves dynamics is therefore of great relevance to the scientific community, in particular in complex scenarios, where high frequencies, broad domains or long time intervals are considered.
Translation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport
Torres, Bernardo, Riou, Alain, Richard, Gaël, Peeters, Geoffroy
ABSTRACT In this paper, we propose an Optimal Transport objective for learning one-dimensional translation-equivariant systems and demonstrate its applicability to single pitch estimation. Our method provides a theoretically grounded, more numerically stable, and simpler alternative for training state-of-the-art self-supervised pitch estimators. 1. INTRODUCTION Pitch estimation is a core task in audio analysis, long studied in the speech and Music Information Retrieval (MIR) communities [1]. It involves estimating the fundamental frequency of harmonic or quasi-harmonic signals, with traditional methods relying on signal processing techniques to extract harmonicity cues [2-4], or by matching the input spectrum to that of a synthetic waveform [5]. Recently, supervised deep learning approaches leveraging large annotated datasets (such as CREPE [6]) have achieved impressive accuracy, but come with notable challenges. In particular, labeling audio with the temporal precision needed for training (typically within a few milliseconds) is labor-intensive and prone to errors.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > South Korea > Daejeon > Daejeon (0.04)
The Tonogenesis Continuum in Tibetan: A Computational Investigation
Tonogenesis-the historical process by which segmental contrasts evolve into lexical tone-has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal U-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
- Europe > Austria > Vienna (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Virginia (0.04)
- (5 more...)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (3 more...)
Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens
A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed 'Just Producible Difference' (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker's formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception
Venkateswaran, Nitin, Tang, Kevin, Wayland, Ratree
Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (Vásquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment's pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India (0.04)
- (8 more...)
Nosey: Open-source hardware for acoustic nasalance
Dewhurst, Maya, Collins, Jack, Lo, Justin J. H., Alderton, Roy, Kirkham, Sam
We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.
- Europe > Austria > Vienna (0.14)
- North America > United States > South Carolina (0.04)
- Europe > United Kingdom > England > Kent > Canterbury (0.04)
- (3 more...)