Classification of worldwide news articles by perceived quality, 2018-2024
McElroy, Connor, de Oliveira, Thiago E. A., Brogly, Chris
–arXiv.org Artificial Intelligence
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
arXiv.org Artificial Intelligence
Nov-21-2025
- Country:
- Europe > Germany
- Berlin (0.04)
- North America
- Canada > Ontario
- Simcoe County > Orillia (0.04)
- United States > New York
- New York County > New York City (0.04)
- Canada > Ontario
- Europe > Germany
- Genre:
- Research Report > New Finding (0.34)
- Technology: