Meta learning with language models: Challenges and opportunities in the classification of imbalanced text
Vassilev, Apostol, Jin, Honglan, Hasan, Munawar
–arXiv.org Artificial Intelligence
Out of policy speech (OOPS) has permeated social media with serious consequences for both individuals and society. Although it comprises a small fraction of the content generated daily on social media, sifting through the data to quickly identify and eliminate the toxic content is difficult. The scale of this problem has long passed a threshold that requires automated detection. Yet it remains to be a challenging problem for machine learning (ML) due to the way OOPS manifests itself in datasets: context-dependent, nuanced, non-colloquial language that may even be syntactically incorrect. Because the OOPS content of the dataset is usually only a small fraction of the overall size, there is a high imbalance between OOPS and in-policy text. Related to this, there are not many high-quality labeled datasets with consistent definitions of OOPS and in-policy content. The difficulties are exacerbated further by significant differences in the distributions of the datasets that the model has been trained on and the data it sees during deployment. When faced with all of these challenges, ML models applied to natural language processing (NLP) tasks quickly reach a performance ceiling that limits their usefulness for sensitive tasks, such as OOPS detection.
arXiv.org Artificial Intelligence
Oct-24-2023
- Country:
- North America > United States (0.46)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine (1.00)
- Technology: