AITopics | legalbench

Collaborating Authors

legalbench

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neural Information Processing SystemsDec-26-2025, 07:22:13 GMT

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning--which distinguish between its many forms--correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

collaboratively, legalbench, measuring legal reasoning, (4 more...)

Neural Information Processing Systems

Industry: Law (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Enguehard, Joseph, Van Ermengem, Morgane, Atkinson, Kate, Cha, Sujeong, Chowdhury, Arijit Ghosh, Ramaswamy, Prashanth Kallur, Roghair, Jeremy, Marlowe, Hannah R, Negreanu, Carina Suzana, Boxall, Kitty, Mincu, Diana

arXiv.org Artificial IntelligenceOct-9-2025

Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.07243

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

LegalBench.PT: A Benchmark for Portuguese Law

Canaverde, Beatriz, Pires, Telmo Pessoa, Ribeiro, Leonor Melo, Martins, André F. T.

arXiv.org Artificial IntelligenceFeb-22-2025

The recent application of LLMs to the legal field has spurred the creation of benchmarks across various jurisdictions and languages. However, no benchmark has yet been specifically designed for the Portuguese legal system. In this work, we present LegalBench.PT, the first comprehensive legal benchmark covering key areas of Portuguese law. To develop LegalBench.PT, we first collect long-form questions and answers from real law exams, and then use GPT-4o to convert them into multiple-choice, true/false, and matching formats. Once generated, the questions are filtered and processed to improve the quality of the dataset. To ensure accuracy and relevance, we validate our approach by having a legal professional review a sample of the generated questions. Although the questions are synthetically generated, we show that their basis in human-created exams and our rigorous filtering and processing methods applied result in a reliable benchmark for assessing LLMs' legal knowledge and reasoning abilities. Finally, we evaluate the performance of leading LLMs on LegalBench.PT and investigate potential biases in GPT-4o's responses. We also assess the performance of Portuguese lawyers on a sample of questions to establish a baseline for model comparison and validate the benchmark.

assumption, benchmark, legalbench, (16 more...)

arXiv.org Artificial Intelligence

2502.16357

Country:

North America > United States (0.68)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Singapore (0.04)
(10 more...)

Genre: Research Report (0.82)

Industry:

Law > Statutes (1.00)
Law > Business Law (1.00)
Law > Environmental Law (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neural Information Processing SystemsJan-19-2025, 13:39:08 GMT

collaboratively, legalbench, measuring legal reasoning, (2 more...)

Neural Information Processing Systems

Industry: Law (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights

Chlapanis, Odysseas S., Galanis, Dimitrios, Androutsopoulos, Ion

arXiv.org Artificial IntelligenceOct-17-2024

We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.13352

Country:

North America > United States (0.28)
Europe > Croatia (0.14)
Europe > Greece (0.04)
(8 more...)

Genre: Research Report (0.82)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.94)
Law > International Law (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

Niklaus, Joel, Zheng, Lucia, McCarthy, Arya D., Hahn, Christopher, Rosen, Brian M., Henderson, Peter, Ho, Daniel E., Honke, Garrett, Liang, Percy, Manning, Christopher

arXiv.org Artificial IntelligenceApr-2-2024

Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples. We present evidence that domain-specific pretraining and instruction tuning improve performance on LegalBench, including improving Flan-T5 XL by 8 points or 16\% over the baseline. However, the effect does not generalize across all tasks, training regimes, model sizes, and other factors. LawInstruct is a resource for accelerating the development of models with stronger information processing and decision making capabilities in the legal domain.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.02127

Country:

Europe > Germany (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland (0.05)
(22 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Law > Litigation (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Guha, Neel, Ho, Daniel E., Nyarko, Julian, Ré, Christopher

arXiv.org Artificial IntelligenceSep-13-2022

Advances in language modeling are changing how American lawyers and administrators envision the practice of law [13]. In transactional settings, computational language tools are already being used in document review [18], and have illustrated promise for more sophisticated tasks like due diligence [4]. In administrative and civil settings [12, 14], many have identified the potential for computational tools to improve the accessibility of legal services [10, 26, 27, 30], thereby alleviating the United States' long standing access-to-justice crisis [8]. Unsurprisingly, the high risk nature of these tools--and their position in society--has inspired calls for better, law specific evaluation and auditing regimes [11]. The potential for impactful computational legal language tools has been magnified by the development of language Foundation Models (FM)--large scale models trained on massive corpora of text [2].

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2209.0612

Country:

North America > United States > California (0.05)
North America > United States > New York (0.04)
North America > United States > Texas (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.50)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback