backbone model
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia > Arlington County (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Asia > China > Sichuan Province > Chengdu (0.05)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (2 more...)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia (0.04)
Dialect Identification Using Resource-Efficient Fine-Tuning Approaches
Lin, Zirui, Gulzar, Haris, Busto, Monnika Roslianna, Masaki, Akiko, Eda, Takeharu, Nakadai, Kazuhiro
Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the general-purpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.
Mitigating Gender Bias in Depression Detection via Counterfactual Inference
Hu, Mingxuan, Ma, Hongbo, Wu, Xinlan, Liu, Ziqi, Liu, Jiaqi, Chen, Yangbin
Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.