MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources
Barham, Samuel, May, Chandler, Van Durme, Benjamin
–arXiv.org Artificial Intelligence
We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.
arXiv.org Artificial Intelligence
Aug-7-2025
- Country:
- Asia > India (0.28)
- Europe > Austria (0.28)
- North America
- United States (0.28)
- Mexico (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.67)
- Technology: