Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis
–arXiv.org Artificial Intelligence
Code-mixing (CM), where speakers blend languages within a single expression, is prevalent in multilingual societies but poses challenges for natural language processing due to its complexity and limited data. We propose using a large language model to generate synthetic CM data, which is then used to enhance the performance of task-specific models for CM sentiment analysis. Our results show that in Spanish-English, synthetic data improved the F1 score by 9.32%, outperforming previous augmentation techniques. However, in Malayalam-English, synthetic data only helped when the baseline was low; with strong natural data, additional synthetic data offered little benefit. Human evaluation confirmed that this approach is a simple, cost-effective way to generate natural-sounding CM sentences, particularly beneficial for low baselines. Our findings suggest that few-shot prompting of large language models is a promising method for CM data augmentation and has significant impact on improving sentiment analysis, an important element in the development of social influence systems.
arXiv.org Artificial Intelligence
Nov-1-2024
- Country:
- Asia (1.00)
- Europe (0.93)
- North America
- Canada (0.28)
- United States (0.46)
- Oceania > Australia (0.28)
- Genre:
- Research Report > New Finding (1.00)
- Technology: