Goto

Collaborating Authors

 Van Durme, Benjamin


Dated Data: Tracing Knowledge Cutoffs in Large Language Models

arXiv.org Artificial Intelligence

Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.


A Closer Look at Claim Decomposition

arXiv.org Artificial Intelligence

As generated text becomes more commonplace, it is increasingly important to evaluate how well-supported such text is by external knowledge sources. Many approaches for evaluating textual support rely on some method for decomposing text into its individual subclaims which are scored against a trusted reference. We investigate how various methods of claim decomposition -- especially LLM-based methods -- affect the result of an evaluation approach such as the recently proposed FActScore, finding that it is sensitive to the decomposition method used. This sensitivity arises because such metrics attribute overall textual support to the model that generated the text even though error can also come from the metric's decomposition step. To measure decomposition quality, we introduce an adaptation of FActScore, which we call DecompScore. We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics and demonstrate its improved decomposition quality over previous methods.


TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

arXiv.org Artificial Intelligence

It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best-of-both-worlds contrast to black-box methods.


LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

arXiv.org Artificial Intelligence

Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.


Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

arXiv.org Artificial Intelligence

Contemporary language models enable new opportunities for structured reasoning with text, such as the construction and evaluation of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment datasets, and evaluate its impact on LLM-based textual inference. We find that our resulting dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in a modern neuro-symbolic reasoning engine significantly improves results (both accuracy and proof quality) over other entailment classifier baselines, illustrating the practical benefit of this advance for textual inference.


OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?

arXiv.org Artificial Intelligence

The presenter pasted in what he called "about 16 pages' worth of tax code" These seven sentences about Alice, Bob, and Charlie come word-for-word from a handcrafted data set we developed at Johns Hopkins University and published in 2020 for training and measuring AI models for reasoning over statutory language. Every word, punctuation mark, and Maryland; Nils number in the taxpayer facts comes exactly from Holzenberger is an our tax_case_9 -- even the percent sign at the start associate professor in of the line. This work has been supported by the U.S. National Science Foundation under grant No. 2204926. The entire livestream is available at OpenAI, "GPT-4 Developer The tax law example starts at minute 19:11. Go to the directory "Cases" to find the file tax_case_9.pl. Tax_case_9.pl is written in the programming language Prolog. Federal content, please visit www.taxnotes.com. Where did the "about 16 pages' worth of tax out the TCJA standard deduction increase at code" come from? Again, from our 2020 data set. SARA has two deduction for 2018 was $24,000. From minute 20:07 to 20:40 of the livestream, handcrafted cases in SARA; tax_case_9 is one of we see some of the tax sections pasted into GPT-4. The statutes consist of nine sections of the These are SARA's heavily edited version of the IRC, For example, at and remove ambiguity. If you put all the SARA 20:23, we see part of section 63(c) with the statutes into a single file it will be about 16 pages paragraphs jumping from (3) to (5); in SARA, we long (depending on the font). At 20:26, we see part of section One of our edits was paring section 1 down to 63(c)(6) with only subparagraphs (A), (B), and (D); only sections 1(a) through (d), which contain the in SARA, we edited out (C). At 20:40, we see parts Clinton-era tax rates. We cut section 1(j), which of section 3306(b) with the paragraphs jumping contains the reduced Tax Cuts and Jobs Act rates from (2) to (7); in SARA, we edited out paragraphs for 2018-2025. This editing explains why GPT-4 (3) through (6). At 20:39 we see sections 3301 and got the wrong answer on the livestream for Alice 3306 regarding the federal unemployment tax; and Bob's 2018 taxes. We did not, however, edit while these two sections are irrelevant to Alice and Bob's tax liability in tax_case_9, they are two The author Holzenberger did all the handcrafting and hand editing. Federal content, please visit www.taxnotes.com. You can We empirically verified that using the SARA download our data set and compare it with the version of the IRC causes GPT-4 to get the wrong livestream's recording on YouTube. First, we The presenter then gives directions to GPT-4: pasted into GPT-4 all nine SARA statutes, plus our "Now calculate their total liability." GPT-4 gives facts about Alice, Bob, and Charlie. Then we detailed step-by-step calculations and concludes used the same "Now calculate their total liability" that "Alice and Bob's total tax liability for 2018 is command.


Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

arXiv.org Artificial Intelligence

Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets.


Streaming Sequence Transduction through Dynamic Compression

arXiv.org Artificial Intelligence

We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.


MultiMUC: Multilingual Template Filling on MUC-4

arXiv.org Artificial Intelligence

We introduce MultiMUC, the first multilingual parallel corpus for template filling, comprising translations of the classic MUC-4 template filling benchmark into five languages: Arabic, Chinese, Farsi, Korean, and Russian. We obtain automatic translations from a strong multilingual machine translation system and manually project the original English annotations into each target language. For all languages, we also provide human translations for sentences in the dev and test splits that contain annotated template arguments. Finally, we present baselines on MultiMUC both with state-of-the-art template filling models and with ChatGPT.


Reframing Tax Law Entailment as Analogical Reasoning

arXiv.org Artificial Intelligence

Statutory reasoning refers to the application of legislative provisions to a series of case facts described in natural language. We re-frame statutory reasoning as an analogy task, where each instance of the analogy task involves a combination of two instances of statutory reasoning. This increases the dataset size by two orders of magnitude, and introduces an element of interpretability. We show that this task is roughly as difficult to Natural Language Processing models as the original task. Finally, we come back to statutory reasoning, solving it with a combination of a retrieval mechanism and analogy models, and showing some progress on prior comparable work.