Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining
Le, Van-Hoang, Nguyen, Duc-Vu, Van Nguyen, Kiet, Nguyen, Ngan Luu-Thuy
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.
arXiv.org Artificial Intelligence
Jul-22-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Vietnam > Hồ Chí Minh City
- Hồ Chí Minh City (0.05)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Law (1.00)
- Technology: