A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
Nguyen, Sang Quang, Van Nguyen, Kiet
–arXiv.org Artificial Intelligence
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
arXiv.org Artificial Intelligence
Feb-10-2025
- Country:
- North America > United States
- Texas (0.04)
- Maryland (0.04)
- California > San Diego County
- San Diego (0.04)
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Spain > Catalonia
- Asia > Vietnam
- Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Leisure & Entertainment (0.67)
- Media (0.46)
- Technology: