LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

Oct-16-2024–arXiv.org Artificial Intelligence

We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan's constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI's text-embedding-3-large outperforms Mistral's mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-16-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.04)
- North America > United States
  - Michigan (0.04)
  - Texas > Travis County
    - Austin (0.04)
  - New York > New York County
    - New York City (0.04)
- Asia
  - China (0.14)
  - Pakistan
    - Khyber Pakhtunkhwa (0.04)
    - Islamabad Capital Territory > Islamabad (0.04)
    - Balochistan (0.04)
    - Punjab > Lahore Division
      - Lahore (0.04)

Genre:
- Research Report (0.50)

Industry:
- Law (1.00)
- Government > Regional Government
  - Asia Government > Pakistan Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found