LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering
–arXiv.org Artificial Intelligence
We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan's constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI's text-embedding-3-large outperforms Mistral's mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.
arXiv.org Artificial Intelligence
Oct-16-2024
- Country:
- Asia
- China (0.14)
- Pakistan
- Balochistan (0.04)
- Islamabad Capital Territory > Islamabad (0.04)
- Khyber Pakhtunkhwa (0.04)
- Punjab > Lahore Division
- Lahore (0.04)
- Europe > Switzerland (0.04)
- North America > United States
- Michigan (0.04)
- New York > New York County
- New York City (0.04)
- Texas > Travis County
- Austin (0.04)
- Asia
- Genre:
- Research Report (0.50)
- Industry:
- Government > Regional Government
- Asia Government > Pakistan Government (1.00)
- Law (1.00)
- Government > Regional Government
- Technology: