The Massive Legal Embedding Benchmark (MLEB)

Butler, Umar, Butler, Abdur-Rahman, Malec, Adrian Lucas

Oct-23-2025–arXiv.org Artificial Intelligence

We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.

information retrieval, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

Oct-23-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.35)
- Europe > Ireland (0.35)
- Asia > Singapore (0.34)
- North America > United States (0.29)

Genre:
- Research Report (0.50)

Industry:
- Law (1.00)
- Government > Regional Government (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Large Language Model (0.66)
  - Information Retrieval (0.58)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found