Blair-Stanek, Andrew
LLMs Provide Unstable Answers to Legal Questions
Blair-Stanek, Andrew, Van Durme, Benjamin
An LLM is stable if it reaches the same conclusion when asked the identical question multiple times. We find leading LLMs like gpt-4o, claude-3.5, and gemini-1.5 are unstable when providing answers to hard legal questions, even when made as deterministic as possible by setting temperature to 0. We curate and release a novel dataset of 500 legal questions distilled from real cases, involving two parties, with facts, competing legal arguments, and the question of which party should prevail. When provided the exact same question, we observe that LLMs sometimes say one party should win, while other times saying the other party should win. This instability has implications for the increasing numbers of legal AI products, legal processes, and lawyers relying on these LLMs.
OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?
Blair-Stanek, Andrew, Holzenberger, Nils, Van Durme, Benjamin
The presenter pasted in what he called "about 16 pages' worth of tax code" These seven sentences about Alice, Bob, and Charlie come word-for-word from a handcrafted data set we developed at Johns Hopkins University and published in 2020 for training and measuring AI models for reasoning over statutory language. Every word, punctuation mark, and Maryland; Nils number in the taxpayer facts comes exactly from Holzenberger is an our tax_case_9 -- even the percent sign at the start associate professor in of the line. This work has been supported by the U.S. National Science Foundation under grant No. 2204926. The entire livestream is available at OpenAI, "GPT-4 Developer The tax law example starts at minute 19:11. Go to the directory "Cases" to find the file tax_case_9.pl. Tax_case_9.pl is written in the programming language Prolog. Federal content, please visit www.taxnotes.com. Where did the "about 16 pages' worth of tax out the TCJA standard deduction increase at code" come from? Again, from our 2020 data set. SARA has two deduction for 2018 was $24,000. From minute 20:07 to 20:40 of the livestream, handcrafted cases in SARA; tax_case_9 is one of we see some of the tax sections pasted into GPT-4. The statutes consist of nine sections of the These are SARA's heavily edited version of the IRC, For example, at and remove ambiguity. If you put all the SARA 20:23, we see part of section 63(c) with the statutes into a single file it will be about 16 pages paragraphs jumping from (3) to (5); in SARA, we long (depending on the font). At 20:26, we see part of section One of our edits was paring section 1 down to 63(c)(6) with only subparagraphs (A), (B), and (D); only sections 1(a) through (d), which contain the in SARA, we edited out (C). At 20:40, we see parts Clinton-era tax rates. We cut section 1(j), which of section 3306(b) with the paragraphs jumping contains the reduced Tax Cuts and Jobs Act rates from (2) to (7); in SARA, we edited out paragraphs for 2018-2025. This editing explains why GPT-4 (3) through (6). At 20:39 we see sections 3301 and got the wrong answer on the livestream for Alice 3306 regarding the federal unemployment tax; and Bob's 2018 taxes. We did not, however, edit while these two sections are irrelevant to Alice and Bob's tax liability in tax_case_9, they are two The author Holzenberger did all the handcrafting and hand editing. Federal content, please visit www.taxnotes.com. You can We empirically verified that using the SARA download our data set and compare it with the version of the IRC causes GPT-4 to get the wrong livestream's recording on YouTube. First, we The presenter then gives directions to GPT-4: pasted into GPT-4 all nine SARA statutes, plus our "Now calculate their total liability." GPT-4 gives facts about Alice, Bob, and Charlie. Then we detailed step-by-step calculations and concludes used the same "Now calculate their total liability" that "Alice and Bob's total tax liability for 2018 is command.
InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Mondal, Ishani, Yuan, Michelle, N, Anandhavelu, Garimella, Aparna, Ferraro, Francis, Blair-Stanek, Andrew, Van Durme, Benjamin, Boyd-Graber, Jordan
Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal supervision. Since the purpose of question answering intersect with the goal of information extraction, we use automatic question generation to induce template slots from the documents and investigate how a tiny amount of a proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost the performance. Extensive experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.
BLT: Can Large Language Models Handle Basic Legal Text?
Blair-Stanek, Andrew, Holzenberger, Nils, Van Durme, Benjamin
We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.
Can GPT-3 Perform Statutory Reasoning?
Blair-Stanek, Andrew, Holzenberger, Nils, Van Durme, Benjamin
Statutory reasoning is the task of reasoning with facts and statutes, which are rules written in natural language by a legislature. It is a basic legal skill. In this paper we explore the capabilities of the most capable GPT-3 model, text-davinci-003, on an established statutory-reasoning dataset called SARA. We consider a variety of approaches, including dynamic few-shot prompting, chain-of-thought prompting, and zero-shot prompting. While we achieve results with GPT-3 that are better than the previous best published results, we also identify several types of clear errors it makes. We investigate why these errors happen. We discover that GPT-3 has imperfect prior knowledge of the actual U.S. statutes on which SARA is based. More importantly, we create simple synthetic statutes, which GPT-3 is guaranteed not to have seen during training. We find GPT-3 performs poorly at answering straightforward questions about these simple synthetic statutes.