Goto

Collaborating Authors

 coreference chain


The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

Bourgois, Antoine, Poibeau, Thierry

arXiv.org Artificial Intelligence

While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.


Coreference as an indicator of context scope in multimodal narrative

Ilinykh, Nikolai, Lappin, Shalom, Sayeed, Asad, Loáiciga, Sharid

arXiv.org Artificial Intelligence

We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality.


Data-driven Coreference-based Ontology Building

Ashury-Tahan, Shir, Cohen, Amir David Nissan, Cohen, Nadav, Louzoun, Yoram, Goldberg, Yoav

arXiv.org Artificial Intelligence

While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license, along with the code.


Generating Visual Stories with Grounded and Coreferent Characters

Liu, Danyang, Lapata, Mirella, Keller, Frank

arXiv.org Artificial Intelligence

Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story's themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.


Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD

Pražák, Ondřej, Konopík, Miloslav

arXiv.org Artificial Intelligence

Coreference resolution is the task of identifying language expressions that refer to the same real-world entity (antecedent) within a text. These coreferential expressions can sometimes appear within a single sentence, but often, they are spread across multiple sentences. In some challenging cases, it is necessary to consider the entire document to determine whether two expressions refer to the same entity. The task can be divided into two main subtasks: identifying entity mentions and grouping these mentions based on the real-world entities they refer to. Coreference resolution is closely related to anaphora resolution, as discussed in [2] Historically, coreference resolution was a standard preprocessing step in various natural language processing (NLP) tasks, such as machine translation, summarization, and information extraction. Although recent large language models have achieved state-of-the-art results in coreference resolution, they are expensive to train and deploy, and traditional (discriminative) approaches remain competitive. Expressing this task in natural language is challenging, and to the best of our knowledge, there have been no successful attempts to utilize large chatbots (like ChatGPT-4) to achieve superior results. Coreference resolution becomes particularly challenging in low-resource languages. One strategy to address this challenge is to train a multilingual model on datasets from multiple languages, thereby transferring knowledge from resource-rich languages to those with fewer resources.


How to Evaluate Coreference in Literary Texts?

Duron-Tejedor, Ana-Isabel, Amsili, Pascal, Poibeau, Thierry

arXiv.org Artificial Intelligence

In this short paper, we examine the main metrics used to evaluate textual coreference and we detail some of their limitations. We show that a unique score cannot represent the full complexity of the problem at stake, and is thus uninformative, or even misleading. We propose a new way of evaluating coreference, taking into account the context (in our case, the analysis of fictions, esp. novels). More specifically, we propose to distinguish long coreference chains (corresponding to main characters), from short ones (corresponding to secondary characters), and singletons (isolated elements). This way, we hope to get more interpretable and thus more informative results through evaluation.


Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis

Okulska, Inez, Wiśnios, Emilia

arXiv.org Artificial Intelligence

Adult content detection still poses a great challenge for automation. Existing classifiers primarily focus on distinguishing between erotic and non-erotic texts. However, they often need more nuance in assessing the potential harm. Unfortunately, the content of this nature falls beyond the reach of generative models due to its potentially harmful nature. Ethical restrictions prohibit large language models (LLMs) from analyzing and classifying harmful erotics, let alone generating them to create synthetic datasets for other neural models. In such instances where data is scarce and challenging, a thorough analysis of the structure of such texts rather than a large model may offer a viable solution. Especially given that harmful erotic narratives, despite appearing similar to harmless ones, usually reveal their harmful nature first through contextual information hidden in the non-sexual parts of the narrative. This paper introduces a hybrid neural and rule-based context-aware system that leverages coreference resolution to identify harmful contextual cues in erotic content. Collaborating with professional moderators, we compiled a dataset and developed a classifier capable of distinguishing harmful from non-harmful erotic content. Our hybrid model, tested on Polish text, demonstrates a promising accuracy of 84% and a recall of 80%. Models based on RoBERTa and Longformer without explicit usage of coreference chains achieved significantly weaker results, underscoring the importance of coreference resolution in detecting such nuanced content as harmful erotics. This approach also offers the potential for enhanced visual explainability, supporting moderators in evaluating predictions and taking necessary actions to address harmful content.


DialogRE^C+: An Extension of DialogRE to Investigate How Much Coreference Helps Relation Extraction in Dialogs

Xiong, Yiyun, Dai, Mengwei, Li, Fei, Fei, Hao, Li, Bobo, Wu, Shengqiong, Ji, Donghong, Teng, Chong

arXiv.org Artificial Intelligence

Dialogue relation extraction (DRE) that identifies the relations between argument pairs in dialogue text, suffers much from the frequent occurrence of personal pronouns, or entity and speaker coreference. This work introduces a new benchmark dataset DialogRE^C+, introducing coreference resolution into the DRE scenario. With the aid of high-quality coreference knowledge, the reasoning of argument relations is expected to be enhanced. In DialogRE^C+ dataset, we manually annotate total 5,068 coreference chains over 36,369 argument mentions based on the existing DialogRE data, where four different coreference chain types namely speaker chain, person chain, location chain and organization chain are explicitly marked. We further develop 4 coreference-enhanced graph-based DRE models, which learn effective coreference representations for improving the DRE task. We also train a coreference resolution model based on our annotations and evaluate the effect of automatically extracted coreference chains demonstrating the practicality of our dataset and its potential to other domains and tasks.


BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations

Rohan, Shadman, Hossain, Mojammel, Rashid, Mohammad Mamun Or, Mohammed, Nabeel

arXiv.org Artificial Intelligence

Coreference Resolution is a well studied problem in NLP. While widely studied for English and other resource-rich languages, research on coreference resolution in Bengali largely remains unexplored due to the absence of relevant datasets. Bengali, being a low-resource language, exhibits greater morphological richness compared to English. In this article, we introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens. We describe the process of creating this dataset and report performance of multiple models trained using BenCoref. We expect that our work provides some valuable insights on the variations in coreference phenomena across several domains in Bengali and encourages the development of additional resources for Bengali. Furthermore, we found poor crosslingual performance at zero-shot setting from English, highlighting the need for more language-specific resources for this task.


Parallel Data Helps Neural Entity Coreference Resolution

Tang, Gongbo, Hardmeier, Christian

arXiv.org Artificial Intelligence

Coreference resolution is the task of finding expressions that refer to the same entity in a text. Coreference models are generally trained on monolingual annotated data but annotating coreference is expensive and challenging. Hardmeier et al.(2013) have shown that parallel data contains latent anaphoric knowledge, but it has not been explored in end-to-end neural models yet. In this paper, we propose a simple yet effective model to exploit coreference knowledge from parallel data. In addition to the conventional modules learning coreference from annotations, we introduce an unsupervised module to capture cross-lingual coreference knowledge. Our proposed cross-lingual model achieves consistent improvements, up to 1.74 percentage points, on the OntoNotes 5.0 English dataset using 9 different synthetic parallel datasets. These experimental results confirm that parallel data can provide additional coreference knowledge which is beneficial to coreference resolution tasks.