ArtiFree: Detecting and Reducing Generative Artifacts in Diffusion-based Speech Enhancement
Chhaglani, Bhawana, Gao, Yang, Richter, Julius, Li, Xilin, Zadissa, Syavosh, Pruthi, Tarun, Lovitt, Andrew
–arXiv.org Artificial Intelligence
SGMSE [1] solves a stochastic differential equation with a learned score network, while the Schr odinger Bridge (SB) [3, 4] casts speech enhancement as an optimal transport problem. These approaches often outperform predictive baselines in terms of perceptual quality and robustness [2, 10]. A key limitation, however, is the emergence of generative artifacts. Unlike predictive models, which mainly distort or suppress existing speech, diffusion-based SE can "hallucinate" new content. Artifacts include phonetic errors such as insertions or substitutions, spurious breathing or hissing, robotic tones, and high-frequency attenuation [10]. These effects are most pronounced at low SNR as shown in Figure 1, where uncertainty drives the model to generate plausible but incorrect phonetic structures, leading to poor ASR performance despite high PESQ or STOI scores [9]. Existing metrics fail to fully capture these errors: intrusive metrics emphasize energy-based distortions at the signal-level, while non-intrusive predictors favor naturalness and overrate generative outputs. Complementary measures such as Levenshtein phoneme distance (LPD) and hallucination error rate (HER) have been proposed to address this gap [11, 12].
arXiv.org Artificial Intelligence
Sep-25-2025
- Country:
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Genre:
- Research Report (0.82)
- Technology: