A Guide to Misinformation Detection Datasets
Thibault, Camille, Peloquin-Skulski, Gabrielle, Tian, Jacob-Junqi, Laflamme, Florence, Guan, Yuxiang, Rabbany, Reihaneh, Godbout, Jean-François, Pelrine, Kellin
–arXiv.org Artificial Intelligence
Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations, or political bias. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv.org Artificial Intelligence
Nov-7-2024
- Country:
- Africa > Nigeria (0.04)
- Asia
- Bangladesh (0.04)
- China (0.04)
- India (0.04)
- Indonesia (0.04)
- Malaysia (0.04)
- Middle East
- Vietnam (0.04)
- Europe
- North America
- Canada > Quebec
- Montreal (0.14)
- Dominican Republic (0.04)
- United States > Michigan (0.04)
- Canada > Quebec
- Oceania > New Zealand
- North Island > Auckland Region > Auckland (0.04)
- South America > Brazil (0.04)
- Genre:
- Overview (1.00)
- Research Report (1.00)
- Industry:
- Health & Medicine > Therapeutic Area
- Immunology (1.00)
- Infections and Infectious Diseases (1.00)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (0.68)
- Large Language Model (1.00)
- Representation & Reasoning (0.92)
- Machine Learning > Neural Networks
- Communications > Social Media (1.00)
- Data Science > Data Mining (1.00)
- Information Management > Search (1.00)
- Artificial Intelligence
- Information Technology