Goto

Collaborating Authors

 Schneider, Nathan


Are UD Treebanks Getting More Consistent? A Report Card for English UD

arXiv.org Artificial Intelligence

We therefore consider it timely to ask encompass not only over 100 languages, but also whether even the largest, most actively developed over 200 treebanks, meaning several languages now UD treebanks for English are actually compatible; have multiple treebanks with rich morphosyntactic if not, to what extent, and are they inching closer and other annotations. Multiple treebanks are especially together or drifting apart from version to version? common for high resource languages such Regardless of the answer to these questions, is it a as English, which currently has data in 9 different good idea to train jointly on EWT and GUM, and if repositories, totaling over 762,000 tokens (as of so, given constant revisions to the data, since what UD v2.11). While this abundance of resources is UD version? of course positive, it opens questions about consistency across multiple UD treebanks of the same


Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?

arXiv.org Artificial Intelligence

In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.


Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation

arXiv.org Artificial Intelligence

Identifying semantically equivalent sentences is important for many cross-lingual and mono-lingual NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to "equivalence," despite previous evidence that fine-grained differences and implicit content have an effect on human understanding (Roth and Anthonio, 2021) and system performance (Briakou and Carpuat, 2021). In this work, we introduce a novel, more sensitive method of characterizing semantic equivalence that leverages Abstract Meaning Representation graph structures. We develop an approach, which can be used with either gold or automatic AMR annotations, and demonstrate that our solution is in fact finer-grained than existing corpus filtering methods and more accurate at predicting strictly equivalent sentences than existing semantic similarity metrics. We suggest that our finer-grained measure of semantic equivalence could limit the workload in the task of human post-edited machine translation and in human evaluation of sentence similarity.


Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets

arXiv.org Artificial Intelligence

For interpreting the behavior of a probabilistic model, it is useful to measure a model's calibration--the extent to which it produces reliable confidence scores. We address the open problem of calibration for tagging models with sparse tagsets, and recommend strategies to measure and reduce calibration error (CE) in such models. We show that several post-hoc recalibration techniques all reduce calibration error across the marginal distribution for two existing sequence taggers. Moreover, we propose tag frequency grouping (TFG) as a way to measure calibration error in different frequency bands. Further, recalibrating each group separately promotes a more equitable reduction of calibration error across the tag frequency spectrum.