ArtiFree: Detecting and Reducing Generative Artifacts in Diffusion-based Speech Enhancement

Chhaglani, Bhawana, Gao, Yang, Richter, Julius, Li, Xilin, Zadissa, Syavosh, Pruthi, Tarun, Lovitt, Andrew

Sep-25-2025–arXiv.org Artificial Intelligence

SGMSE [1] solves a stochastic differential equation with a learned score network, while the Schr odinger Bridge (SB) [3, 4] casts speech enhancement as an optimal transport problem. These approaches often outperform predictive baselines in terms of perceptual quality and robustness [2, 10]. A key limitation, however, is the emergence of generative artifacts. Unlike predictive models, which mainly distort or suppress existing speech, diffusion-based SE can "hallucinate" new content. Artifacts include phonetic errors such as insertions or substitutions, spurious breathing or hissing, robotic tones, and high-frequency attenuation [10]. These effects are most pronounced at low SNR as shown in Figure 1, where uncertainty drives the model to generate plausible but incorrect phonetic structures, leading to poor ASR performance despite high PESQ or STOI scores [9]. Existing metrics fail to fully capture these errors: intrusive metrics emphasize energy-based distortions at the signal-level, while non-intrusive predictors favor naturalness and overrate generative outputs. Complementary measures such as Levenshtein phoneme distance (LPD) and hallucination error rate (HER) have been proposed to address this gap [11, 12].

artifact, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Sep-25-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.89)
  - Speech > Speech Recognition (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found