ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity
Jayawardena, Lasal, Yapa, Prasan
–arXiv.org Artificial Intelligence
Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
arXiv.org Artificial Intelligence
Apr-18-2024
- Country:
- Africa > Sudan (0.04)
- Asia
- China > Hong Kong (0.04)
- Japan
- Hokkaidō > Hokkaidō Prefecture
- Sapporo (0.04)
- Honshū
- Kansai > Kyoto Prefecture
- Kyoto (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.14)
- Kansai > Kyoto Prefecture
- Hokkaidō > Hokkaidō Prefecture
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Sri Lanka (0.04)
- Europe
- Belgium (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Hungary > Budapest
- Budapest (0.04)
- United Kingdom > Scotland
- City of Aberdeen > Aberdeen (0.04)
- North America
- Dominican Republic (0.04)
- United States
- California > San Diego County
- La Jolla (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > San Diego County
- Oceania > Australia
- Genre:
- Research Report > New Finding (1.00)
- Technology: