A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

Feb-10-2025–arXiv.org Artificial Intelligence

This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Texas (0.04)
  - Maryland (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Europe
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
- Asia > Vietnam
  - Hồ Chí Minh City > Hồ Chí Minh City (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Leisure & Entertainment (0.67)
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found