AITopics | audio quality

Collaborating Authors

audio quality

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Efficient Video-to-Audio Generation Network with Rectified Flow Matching Y ongqi Wang

Neural Information Processing SystemsFeb-18-2026, 12:42:56 GMT

V2A model based on rectified flow matching.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (0.67)
Information Technology (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.93)
(2 more...)

Add feedback

The power of sound in a virtual world

MIT Technology ReviewJan-26-2026, 14:00:00 GMT

In the digital age, sound is proving to be the greatest connector of all, says Erik Vaveris, vice president of product management and CMO at Shure, and Brian Scholl, director of the Perception and Cognition Laboratory at Yale University. In an era where business, education, and even casual conversations occur via screens, sound has become a differentiating factor. We obsess over lighting, camera angles, and virtual backgrounds, but how we sound can be just as critical to credibility, trust, and connection. Both see audio as more than a technical layer: It's a human factor shaping how people perceive intelligence, trustworthiness, and authority in virtual settings. If you're willing to take a little bit of time with your audio set up, you can really get across the full power of your message and the full power of who you are to your peers, to your employees, your boss, your suppliers, and of course, your customers, says Vaveris. Scholl's research shows that poor audio quality can make a speaker seem less persuasive, less hireable, and even less credible. We know that [poor] sound doesn't reflect the people themselves, but we really just can't stop ourselves from having those impressions, says Scholl. We all understand intuitively that if we're having difficulty being understood while we're talking, then that's bad. But we sort of think that as long as you can make out the words I'm saying, then that's probably all fine. And this research showed in a somewhat surprising way, to a surprising degree, that this is not so. For organizations navigating hybrid work, training, and marketing, the stakes have become high. Vaveris points out that the pandemic was a watershed moment for audio technology. As classrooms, boardrooms, and conferences shifted online almost overnight, demand accelerated for advanced noise suppression, echo cancellation, and AI-driven processing tools that make meetings more seamless. Today, machine learning algorithms can strip away keyboard clicks or reverberation and isolate a speaker's voice in noisy environments. That clarity underpins the accuracy of AI meeting assistants that can step in to transcribe, summarize, and analyze discussions. The implications across industries are rippling. It empowers executives and creators alike to produce broadcast-quality content from the comfort of their home office. And it offers companies new ways to build credibility with customers and employees without the costly overhead of traditional production.

audio quality, audio signal, native language, (14 more...)

MIT Technology Review

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England (0.04)
(2 more...)

Genre: Personal > Interview (0.93)

Industry:

Health & Medicine (0.68)
Marketing (0.46)
Education > Educational Setting (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Collaboration (0.93)

Add feedback

The overlooked driver of digital transformation

MIT Technology ReviewJan-5-2026, 16:00:00 GMT

Clear, reliable audio is no longer optional, say Genevieve Juillard, CEO of IDC, and Chris Schyvinck, president and CEO at Shure. When business leaders talk about digital transformation, their focus often jumps straight to cloud platforms, AI tools, or collaboration software. Yet, one of the most fundamental enablers of how organizations now work, and how employees experience that work, is often overlooked: audio. As Genevieve Juillard, CEO of IDC, notes, the shift to hybrid collaboration made every space, from corporate boardrooms to kitchen tables, meeting-ready almost overnight. In the scramble, audio quality often lagged, creating what research now shows is more than a nuisance. Poor sound can alter how speakers are perceived, making them seem less credible or even less trustworthy. Audio is the gatekeeper of meaning," stresses Julliard. "If people can't hear clearly, they can't understand you. And if they can't understand you, they can't trust you, and they can't act on what you said. And no amount of sharp video can fix that. For Shure, which has spent a century advancing sound technology, the implications extend far beyond convenience.

audio, genevieve, transformation, (14 more...)

MIT Technology Review

Country:

North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > East Sussex > Brighton (0.04)
Asia > China > Beijing > Beijing (0.04)

Industry:

Health & Medicine (0.68)
Information Technology (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Li, Kexin, Hu, Xiao, Grishchenko, Ilya, Lie, David

arXiv.org Artificial IntelligenceNov-27-2025

The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.

artificial intelligence, harmonicattack, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.21577

Genre: Research Report > New Finding (1.00)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

FoleyBench: A Benchmark For Video-to-Audio Models

Dixit, Satvik, Saito, Koichi, Zhong, Zhi, Mitsufuji, Yuki, Donahue, Chris

arXiv.org Artificial IntelligenceNov-25-2025

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.13219

Genre: Research Report (0.83)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

AudioMarkBench: Benchmarking Robustness of Audio Watermarking

Neural Information Processing SystemsNov-18-2025, 21:43:31 GMT

The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios.

artificial intelligence, machine learning, perturbation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New Hampshire (0.04)
Europe > France (0.04)
Asia > Taiwan (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

PitchFlower: A flow-based neural audio codec with pitch controllability

Torres, Diego, Roebel, Axel, Obin, Nicolas

arXiv.org Artificial IntelligenceOct-30-2025

Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFi-GAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

disentanglement, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.25566

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.69)

Add feedback

AudioMarkBench: Benchmarking Robustness of Audio Watermarking

Neural Information Processing SystemsOct-10-2025, 03:58:15 GMT

no-box perturbation, perturbation, watermark-removal perturbation, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > New Hampshire (0.04)
Europe > France (0.04)
Asia > Taiwan (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations

Kim, Seungmin, Park, Sohee, Kim, Donghyun, Lee, Jisu, Choi, Daeseon

arXiv.org Artificial IntelligenceMay-20-2025

With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios.

artificial intelligence, machine learning, perturbation, (16 more...)

arXiv.org Artificial Intelligence

2505.12686

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > California (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)
Media (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SoK: How Robust is Audio Watermarking in Generative AI models?

Wen, Yizhu, Innuganti, Ashwin, Ramos, Aaron Bien, Guo, Hanqing, Yan, Qiben

arXiv.org Artificial IntelligenceMar-26-2025

Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at https://sokaudiowm.github.io/.

data quality, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.19176

Country:

North America > United States > Michigan (0.04)
North America > United States > Hawaii (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Quality > Data Transformation (1.00)
(3 more...)

Add feedback