audio
- North America > United States > Illinois (0.04)
- Asia > India (0.04)
- North America > United States (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (7 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- North America > Canada > Quebec > Montreal (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- North America > Canada > Quebec > Montreal (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
The overlooked driver of digital transformation
Clear, reliable audio is no longer optional, say Genevieve Juillard, CEO of IDC, and Chris Schyvinck, president and CEO at Shure. When business leaders talk about digital transformation, their focus often jumps straight to cloud platforms, AI tools, or collaboration software. Yet, one of the most fundamental enablers of how organizations now work, and how employees experience that work, is often overlooked: audio. As Genevieve Juillard, CEO of IDC, notes, the shift to hybrid collaboration made every space, from corporate boardrooms to kitchen tables, meeting-ready almost overnight. In the scramble, audio quality often lagged, creating what research now shows is more than a nuisance. Poor sound can alter how speakers are perceived, making them seem less credible or even less trustworthy. Audio is the gatekeeper of meaning," stresses Julliard. "If people can't hear clearly, they can't understand you. And if they can't understand you, they can't trust you, and they can't act on what you said. And no amount of sharp video can fix that. For Shure, which has spent a century advancing sound technology, the implications extend far beyond convenience.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > East Sussex > Brighton (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Health & Medicine (0.68)
- Information Technology (0.48)
Self-Supervised Visual Acoustic Matching
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio---without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets ($\sim$136K samples, over 400 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health.
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Hong, Jiaying, Zhu, Ting, Markchom, Thanet, Liang, Huizhi
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
- Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)
- Europe > United Kingdom > England > Berkshire > Reading (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)