Iyer, Varun
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
Guha, Neel, Nyarko, Julian, Ho, Daniel E., Ré, Christopher, Chilton, Adam, Narayana, Aditya, Chohlas-Wood, Alex, Peters, Austin, Waldon, Brandon, Rockmore, Daniel N., Zambrano, Diego, Talisman, Dmitry, Hoque, Enam, Surani, Faiz, Fagan, Frank, Sarfaty, Galit, Dickinson, Gregory M., Porat, Haggai, Hegland, Jason, Wu, Jessica, Nudell, Joe, Niklaus, Joel, Nay, John, Choi, Jonathan H., Tobia, Kevin, Hagan, Margaret, Ma, Megan, Livermore, Michael, Rasumov-Rahe, Nikon, Holzenberger, Nils, Kolt, Noam, Henderson, Peter, Rehaag, Sean, Goel, Sharad, Gao, Shang, Williams, Spencer, Gandhi, Sunny, Zur, Tom, Iyer, Varun, Li, Zehua
The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation
Huang, Kuan-Hao, Iyer, Varun, Hsu, I-Hung, Kumar, Anoop, Chang, Kai-Wei, Galstyan, Aram
Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity -- the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.