Meta learning with language models: Challenges and opportunities in the classification of imbalanced text

Vassilev, Apostol, Jin, Honglan, Hasan, Munawar

arXiv.org Artificial Intelligence 

Out of policy speech (OOPS) has permeated social media with serious consequences for both individuals and society. Although it comprises a small fraction of the content generated daily on social media, sifting through the data to quickly identify and eliminate the toxic content is difficult. The scale of this problem has long passed a threshold that requires automated detection. Yet it remains to be a challenging problem for machine learning (ML) due to the way OOPS manifests itself in datasets: context-dependent, nuanced, non-colloquial language that may even be syntactically incorrect. Because the OOPS content of the dataset is usually only a small fraction of the overall size, there is a high imbalance between OOPS and in-policy text. Related to this, there are not many high-quality labeled datasets with consistent definitions of OOPS and in-policy content. The difficulties are exacerbated further by significant differences in the distributions of the datasets that the model has been trained on and the data it sees during deployment. When faced with all of these challenges, ML models applied to natural language processing (NLP) tasks quickly reach a performance ceiling that limits their usefulness for sensitive tasks, such as OOPS detection.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found