Goto

Collaborating Authors

 legalbench


LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neural Information Processing Systems

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning--which distinguish between its many forms--correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.


LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Enguehard, Joseph, Van Ermengem, Morgane, Atkinson, Kate, Cha, Sujeong, Chowdhury, Arijit Ghosh, Ramaswamy, Prashanth Kallur, Roghair, Jeremy, Marlowe, Hannah R, Negreanu, Carina Suzana, Boxall, Kitty, Mincu, Diana

arXiv.org Artificial Intelligence

Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.


LegalBench.PT: A Benchmark for Portuguese Law

Canaverde, Beatriz, Pires, Telmo Pessoa, Ribeiro, Leonor Melo, Martins, André F. T.

arXiv.org Artificial Intelligence

The recent application of LLMs to the legal field has spurred the creation of benchmarks across various jurisdictions and languages. However, no benchmark has yet been specifically designed for the Portuguese legal system. In this work, we present LegalBench.PT, the first comprehensive legal benchmark covering key areas of Portuguese law. To develop LegalBench.PT, we first collect long-form questions and answers from real law exams, and then use GPT-4o to convert them into multiple-choice, true/false, and matching formats. Once generated, the questions are filtered and processed to improve the quality of the dataset. To ensure accuracy and relevance, we validate our approach by having a legal professional review a sample of the generated questions. Although the questions are synthetically generated, we show that their basis in human-created exams and our rigorous filtering and processing methods applied result in a reliable benchmark for assessing LLMs' legal knowledge and reasoning abilities. Finally, we evaluate the performance of leading LLMs on LegalBench.PT and investigate potential biases in GPT-4o's responses. We also assess the performance of Portuguese lawyers on a sample of questions to establish a baseline for model comparison and validate the benchmark.


LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neural Information Processing Systems

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning--which distinguish between its many forms--correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary.


LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights

Chlapanis, Odysseas S., Galanis, Dimitrios, Androutsopoulos, Ion

arXiv.org Artificial Intelligence

We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.


FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

Niklaus, Joel, Zheng, Lucia, McCarthy, Arya D., Hahn, Christopher, Rosen, Brian M., Henderson, Peter, Ho, Daniel E., Honke, Garrett, Liang, Percy, Manning, Christopher

arXiv.org Artificial Intelligence

Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples. We present evidence that domain-specific pretraining and instruction tuning improve performance on LegalBench, including improving Flan-T5 XL by 8 points or 16\% over the baseline. However, the effect does not generalize across all tasks, training regimes, model sizes, and other factors. LawInstruct is a resource for accelerating the development of models with stronger information processing and decision making capabilities in the legal domain.


LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Guha, Neel, Ho, Daniel E., Nyarko, Julian, Ré, Christopher

arXiv.org Artificial Intelligence

Advances in language modeling are changing how American lawyers and administrators envision the practice of law [13]. In transactional settings, computational language tools are already being used in document review [18], and have illustrated promise for more sophisticated tasks like due diligence [4]. In administrative and civil settings [12, 14], many have identified the potential for computational tools to improve the accessibility of legal services [10, 26, 27, 30], thereby alleviating the United States' long standing access-to-justice crisis [8]. Unsurprisingly, the high risk nature of these tools--and their position in society--has inspired calls for better, law specific evaluation and auditing regimes [11]. The potential for impactful computational legal language tools has been magnified by the development of language Foundation Models (FM)--large scale models trained on massive corpora of text [2].