Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction
Gomes, Juliana Resplande Sant'anna, Filho, Arlindo Rodrigues Galvão
–arXiv.org Artificial Intelligence
The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets ( corpora) that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user's verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and pre-processing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora. The main results demonstrate the methodology's viability, providing enriched corpora and analyses that confirm the utility of claim extraction, the influence of original data characteristics on the process, and the positive impact of enrichment on the performance of classification models (Bertimbau and Gemini 1.5 Flash), especially with fine-tuning. This work contributes valuable resources and insights for advancing SAFC in Portuguese.
arXiv.org Artificial Intelligence
Aug-12-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Indonesia > Bali (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Macao (0.04)
- Middle East
- Qatar > Ad-Dawhah
- Doha (0.04)
- Syria > Daraa Governorate
- Dar'a (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Qatar > Ad-Dawhah
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Bulgaria > Varna Province
- Varna (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Portugal
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Galicia
- A Coruña Province > Santiago de Compostela (0.04)
- Madrid (0.04)
- Catalonia > Barcelona Province
- Switzerland (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Belgium > Brussels-Capital Region
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Michigan > Genesee County
- Flint (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Miami-Dade County
- Mexico > Mexico City
- South America
- Brazil
- Minas Gerais > Belo Horizonte (0.04)
- Rio de Janeiro > Rio de Janeiro (0.04)
- Ceará > Fortaleza (0.04)
- Pernambuco > Recife (0.04)
- Paraná > Curitiba (0.04)
- Federal District > Brasília (0.04)
- São Paulo (0.04)
- Goiás > Goiânia (0.04)
- Rio Grande do Sul > Porto Alegre (0.04)
- Chile > Santiago Metropolitan Region
- Santiago Province > Santiago (0.04)
- Brazil
- Asia
- Genre:
- Overview (0.67)
- Research Report (0.70)
- Industry:
- Health & Medicine > Therapeutic Area
- Information Technology > Services (1.00)
- Media > News (0.70)
- Technology: