Building for Tomorrow: Assessing the Temporal Persistence of Text Classifiers

Alkhalifa, Rabab, Kochkina, Elena, Zubiaga, Arkaitz

Nov-19-2022–arXiv.org Artificial Intelligence

A supervised text classification model relies on labelled datasets to train the model (Sebastiani, 2002). From an experimental perspective, the design and evaluation of classification models typically rely on data pertaining to fixed periods of time. Recent research demonstrates that such models, while showing competitive performance in their experimental environment, underperform when they need to classify new data that is distant in time from that observed during training (Alkhalifa and Zubiaga, 2022). This deterioration of performance has been demonstrated for different classification tasks, including topic classification (Rocha, Mourão, Pereira, Gonçalves, and Meira, 2008), sentiment classification (Lukes and Søgaard, 2018), hate speech detection (Florio, Basile, Polignano, Basile, and Patti, 2020), stance detection (Alkhalifa, Kochkina, and Zubiaga, 2021) and political ideology detection (Röttger and Pierrehumbert, 2021). This performance drop can happen for multiple reasons, including among others the evolution in language use (Smith, 2004) or the evolution of public opinion (Bonilla and Mo, 2019) and its extent may vary (Alkhalifa et al., 2021). This poses an important challenge and limitation on such models when one plans to continue using the model over a long period of time to classify new, incoming data, as can be the case with a stream of user-generated contents (Cheng, Chen, Lee, and Li, 2021).

machine learning, natural language, text classification, (22 more...)

arXiv.org Artificial Intelligence

Nov-19-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - California > Santa Clara County
    - Palo Alto (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Russia > Volga Federal District
    - Nizhny Novgorod Oblast > Nizhny Novgorod (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East
    - Syria (0.04)
    - Iraq (0.04)
    - Saudi Arabia > Eastern Province
      - Dammam (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Classification (1.00)
    - Text Processing (0.93)
    - Machine Translation (0.67)
    - Information Extraction (0.66)
  - Machine Learning
    - Statistical Learning (1.00)
    - Performance Analysis > Accuracy (1.00)
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found