Leveraging LLMs to Create Content Corpora for Niche Domains
Zhang, Franklin, Zhang, Sonya, Halevy, Alon
–arXiv.org Artificial Intelligence
Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.
arXiv.org Artificial Intelligence
Aug-1-2025
- Country:
- Europe
- North America
- Canada > Ontario
- Toronto (0.05)
- United States
- California > Santa Clara County
- Mountain View (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Indiana > Marion County
- Indianapolis (0.04)
- New York > New York County
- New York City (0.05)
- Washington > King County
- California > Santa Clara County
- Canada > Ontario
- Genre:
- Questionnaire & Opinion Survey (1.00)
- Research Report (0.82)
- Industry:
- Health & Medicine (0.46)
- Information Technology > Services (0.46)
- Technology: