audio quality
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China (0.04)
- Media (0.67)
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
The power of sound in a virtual world
In the digital age, sound is proving to be the greatest connector of all, says Erik Vaveris, vice president of product management and CMO at Shure, and Brian Scholl, director of the Perception and Cognition Laboratory at Yale University. In an era where business, education, and even casual conversations occur via screens, sound has become a differentiating factor. We obsess over lighting, camera angles, and virtual backgrounds, but how we sound can be just as critical to credibility, trust, and connection. Both see audio as more than a technical layer: It's a human factor shaping how people perceive intelligence, trustworthiness, and authority in virtual settings. If you're willing to take a little bit of time with your audio set up, you can really get across the full power of your message and the full power of who you are to your peers, to your employees, your boss, your suppliers, and of course, your customers, says Vaveris. Scholl's research shows that poor audio quality can make a speaker seem less persuasive, less hireable, and even less credible. We know that [poor] sound doesn't reflect the people themselves, but we really just can't stop ourselves from having those impressions, says Scholl. We all understand intuitively that if we're having difficulty being understood while we're talking, then that's bad. But we sort of think that as long as you can make out the words I'm saying, then that's probably all fine. And this research showed in a somewhat surprising way, to a surprising degree, that this is not so. For organizations navigating hybrid work, training, and marketing, the stakes have become high. Vaveris points out that the pandemic was a watershed moment for audio technology. As classrooms, boardrooms, and conferences shifted online almost overnight, demand accelerated for advanced noise suppression, echo cancellation, and AI-driven processing tools that make meetings more seamless. Today, machine learning algorithms can strip away keyboard clicks or reverberation and isolate a speaker's voice in noisy environments. That clarity underpins the accuracy of AI meeting assistants that can step in to transcribe, summarize, and analyze discussions. The implications across industries are rippling. It empowers executives and creators alike to produce broadcast-quality content from the comfort of their home office. And it offers companies new ways to build credibility with customers and employees without the costly overhead of traditional production.
- North America > United States > Massachusetts (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England (0.04)
- (2 more...)
- Health & Medicine (0.68)
- Marketing (0.46)
- Education > Educational Setting (0.46)
The overlooked driver of digital transformation
Clear, reliable audio is no longer optional, say Genevieve Juillard, CEO of IDC, and Chris Schyvinck, president and CEO at Shure. When business leaders talk about digital transformation, their focus often jumps straight to cloud platforms, AI tools, or collaboration software. Yet, one of the most fundamental enablers of how organizations now work, and how employees experience that work, is often overlooked: audio. As Genevieve Juillard, CEO of IDC, notes, the shift to hybrid collaboration made every space, from corporate boardrooms to kitchen tables, meeting-ready almost overnight. In the scramble, audio quality often lagged, creating what research now shows is more than a nuisance. Poor sound can alter how speakers are perceived, making them seem less credible or even less trustworthy. Audio is the gatekeeper of meaning," stresses Julliard. "If people can't hear clearly, they can't understand you. And if they can't understand you, they can't trust you, and they can't act on what you said. And no amount of sharp video can fix that. For Shure, which has spent a century advancing sound technology, the implications extend far beyond convenience.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > East Sussex > Brighton (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Health & Medicine (0.68)
- Information Technology (0.48)
HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal
Li, Kexin, Hu, Xiao, Grishchenko, Ilya, Lie, David
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
FoleyBench: A Benchmark For Video-to-Audio Models
Dixit, Satvik, Saito, Koichi, Zhong, Zhi, Mitsufuji, Yuki, Donahue, Chris
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Leisure & Entertainment (1.00)
- Media > Music (0.34)
- North America > United States > New Hampshire (0.04)
- Europe > France (0.04)
- Asia > Taiwan (0.04)
- (2 more...)
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
PitchFlower: A flow-based neural audio codec with pitch controllability
Torres, Diego, Roebel, Axel, Obin, Nicolas
Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFi-GAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
- North America > United States > New Hampshire (0.04)
- Europe > France (0.04)
- Asia > Taiwan (0.04)
- (2 more...)
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations
Kim, Seungmin, Park, Sohee, Kim, Donghyun, Lee, Jisu, Choi, Daeseon
With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > California (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.93)
- Media (0.88)
SoK: How Robust is Audio Watermarking in Generative AI models?
Wen, Yizhu, Innuganti, Ashwin, Ramos, Aaron Bien, Guo, Hanqing, Yan, Qiben
Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at https://sokaudiowm.github.io/.
- North America > United States > Michigan (0.04)
- North America > United States > Hawaii (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)