Porada, Ian
Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution
Porada, Ian, Cheung, Jackie Chi Kit
Challenge sets such as the Winograd Schema Challenge (WSC) are used to benchmark systems' ability to resolve ambiguities in natural language. If one assumes as in existing work that solving a given challenge set is at least as difficult as solving some more general task, then high performance on the challenge set should indicate high performance on the general task overall. However, we show empirically that this assumption of difficulty does not always hold. In particular, we demonstrate that despite the strong performance of prompted language models (LMs) on the WSC and its variants, these same modeling techniques perform relatively poorly at resolving certain pronominal ambiguities attested in OntoNotes and related datasets that are perceived to be easier. Motivated by these findings, we propose a method for ensembling a prompted LM with a supervised, task-specific system that is overall more accurate at resolving pronominal coreference across datasets. Finally, we emphasize that datasets involving the same linguistic phenomenon draw on distinct, but overlapping, capabilities, and evaluating on any one dataset alone does not provide a complete picture of a system's overall capability.
A Controlled Reevaluation of Coreference Resolution Models
Porada, Ian, Zou, Xiyuan, Cheung, Jackie Chi Kit
All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years.
Investigating Failures to Generalize for Coreference Resolution Models
Porada, Ian, Olteanu, Alexandra, Suleman, Kaheer, Trischler, Adam, Cheung, Jackie Chi Kit
Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
Faster and More Accurate Learning with Meta Trace Adaptation
Zhao, Mingde, Porada, Ian
Assembling the compound targets is an open problem for achieving good learning performance in Reinforcement Learning (RL). TD(λ), which uses a single parameter controlled geometric sequence as the weights of the n-step returns, stands out from the sea of compound update methods for its efficient incremental updates and its interesting mathematical properties. Empirical studies show that different λ's yield different performance. Furthermore, it is expected that adapting λ appropriately during the learning boosts performance in terms of convergence speed and accuracy. The goal of this paper is to find a method that optimizes the overall target error for all the states. We first derive a new state meta-objective for optimizing the bias-variance tradeoff and show that the meta-objective proposed in an existing work [1] is actually a special case of the newly proposed objective. Then, we propose a trust-region style method to tackle the difficulties of optimizing the meta-objective and prove its equivalence to optimizing the overall target error, given appropriate assumptions. In experiments, we observe that the proposed method MTA has generally significantly improved empirical performance over the existing method and baselines.