AITopics | coreference

Collaborating Authors

coreference

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

Zhuang, Jiaxi, Li, Kangning, Hou, Jue, Xu, Mingjun, Gao, Zhifeng, Cai, Hengxing

arXiv.org Artificial IntelligenceOct-14-2025

Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT -4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app. The code and data are provided in the supplementary materials.

acsmedchemlett, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2506.21625

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.89)
Health & Medicine > Therapeutic Area (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Ma, Chengqian, Tao, Wei, Guo, Yiwen

arXiv.org Artificial IntelligenceOct-7-2025

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.22968

Country:

Asia (0.93)
Europe (0.67)
North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

92650b2e92217715fe312e6fa7b90d82-AuthorFeedback.pdf

Neural Information Processing SystemsMay-30-2025, 04:21:24 GMT

We thank the reviewers for their thoughtful feedback and helpful suggestions. We address specific points below. Dwork (2012) defines an algorithm to be fair if it gives similar predictions to similar individuals. The formalization of this definition was extended into Counterfactual Fairness (Kusner, 2017). XLNet, which are consistent with the results from GPT-2.

assumption, dataset, winograd-style dataset, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)

Add feedback

Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation

Cao, Zhiyu, Li, Peifeng, Fan, Yaxin, Zhu, Qiaoming

arXiv.org Artificial IntelligenceMar-20-2025

Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at https://github.com/Dewset/EO-IUR.

large language model, machine learning, utterance, (19 more...)

arXiv.org Artificial Intelligence

2503.16043

Country:

Asia > China > Shandong Province > Qingdao (0.05)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Hong Kong (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

EventFull: Complete and Consistent Event Relation Annotation

Eirew, Alon, Nachshoni, Eviatar, Slobodkin, Aviv, Dagan, Ido

arXiv.org Artificial IntelligenceDec-17-2024

MEANTIME (Minard et al., 2016), and EventStoryLine Identifying the semantic relations between events (Caselli and Vossen, 2017) restrict event mentioned in a text, notably temporal, causal and pairs to a span of two consecutive sentences. This coreference relations, has been a fundamental goal limitation inherently prevents testing and training in NLP. Substantial efforts have been devoted to developing models on longer-range relations. Other datasets, various datasets that capture some or all of such as TimeBank (Pustejovsky et al., 2003b) and these relations (O'Gorman et al., 2016; Hong et al., MAVEN-ERE (Wang et al., 2022), did not publish 2016; Wang et al., 2022). These datasets were then a systematic annotation execution protocol that leveraged to develop and to evaluate corresponding guarantees actual complete annotation, and were models for detecting event-event relations (Hu subsequently criticized for being incomplete in et al., 2023; Guan et al., 2024). The output of such their relation annotation (Pustejovsky and Stubbs, models has been utilized in a range of downstream 2011; Rogers et al., 2024). Further, some researchers applications, with recent examples including event aimed to avoid the cost of manual annotation forecasting (Ma et al., 2023), misinformation detection altogether and employed fully-or partlyautomatic (Lei and Huang, 2023), and treatment timeline dataset creation methods (Mirza et al., extraction (Yao et al., 2024), among others.

artificial intelligence, natural language, relation, (17 more...)

arXiv.org Artificial Intelligence

2412.12733

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Europe > Slovenia (0.04)
(15 more...)

Genre:

Research Report (0.82)
Workflow (0.68)

Industry: Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution

Porada, Ian, Cheung, Jackie Chi Kit

arXiv.org Artificial IntelligenceOct-12-2024

Challenge sets such as the Winograd Schema Challenge (WSC) are used to benchmark systems' ability to resolve ambiguities in natural language. If one assumes as in existing work that solving a given challenge set is at least as difficult as solving some more general task, then high performance on the challenge set should indicate high performance on the general task overall. However, we show empirically that this assumption of difficulty does not always hold. In particular, we demonstrate that despite the strong performance of prompted language models (LMs) on the WSC and its variants, these same modeling techniques perform relatively poorly at resolving certain pronominal ambiguities attested in OntoNotes and related datasets that are perceived to be easier. Motivated by these findings, we propose a method for ensembling a prompted LM with a supervised, task-specific system that is overall more accurate at resolving pronominal coreference across datasets. Finally, we emphasize that datasets involving the same linguistic phenomenon draw on distinct, but overlapping, capabilities, and evaluating on any one dataset alone does not provide a complete picture of a system's overall capability.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2410.09448

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(29 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Education (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Unifying the Scope of Bridging Anaphora Types in English: Bridging Annotations in ARRAU and GUM

Levine, Lauren, Zeldes, Amir

arXiv.org Artificial IntelligenceOct-1-2024

Comparing bridging annotations across coreference resources is difficult, largely due to a lack of standardization across definitions and annotation schemas and narrow coverage of disparate text domains across resources. To alleviate domain coverage issues and consolidate schemas, we compare guidelines and use interpretable predictive models to examine the bridging instances annotated in the GUM, GENTLE and ARRAU corpora. Examining these cases, we find that there is a large difference in types of phenomena annotated as bridging. Beyond theoretical results, we release a harmonized, subcategorized version of the test sets of GUM, GENTLE and the ARRAU Wall Street Journal data to promote meaningful and reliable evaluation of bridging resolution across domains.

anaphor, annotation, classifier, (15 more...)

arXiv.org Artificial Intelligence

2410.0117

Country:

North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Dominican Republic (0.04)
(8 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Industry:

Law (0.46)
Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.52)

Add feedback

Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

Forer, Lior, Hope, Tom

arXiv.org Artificial IntelligenceSep-24-2024

We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. LLMs can struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings we achieve large gains in performance. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific concepts.

prediction, relational definition, representation, (15 more...)

arXiv.org Artificial Intelligence

2409.15113

Country: Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

The Self-Contained Negation Test Set

Kletz, David, Amsili, Pascal, Candito, Marie

arXiv.org Artificial IntelligenceAug-21-2024

Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs' predictions as a function of the polarity of inputs, in English. Crucially, this test uses ``self-contained'' inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.

act-token, negation, repetition, (16 more...)

arXiv.org Artificial Intelligence

2408.11469

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Are LLMs Good Annotators for Discourse-level Event Relation Extraction?

Wei, Kangda, Gautam, Aayush, Huang, Ruihong

arXiv.org Artificial IntelligenceJul-28-2024

Large Language Models (LLMs) have demonstrated proficiency in a wide array of natural language processing tasks. However, its effectiveness over discourse-level event relation extraction (ERE) tasks remains unexplored. In this paper, we assess the effectiveness of LLMs in addressing discourse-level ERE tasks characterized by lengthy documents and intricate relations encompassing coreference, temporal, causal, and subevent types. Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2. Our study reveals a notable underperformance of LLMs compared to the baseline established through supervised learning. Although Supervised Fine-Tuning (SFT) can improve LLMs performance, it does not scale well compared to the smaller supervised baseline model. Our quantitative and qualitative analysis shows that LLMs have several weaknesses when applied for extracting event relations, including a tendency to fabricate event mentions, and failures to capture transitivity rules among relations, detect long distance relations, or comprehend contexts with dense event mentions.

coreference, gpt-3, relation, (14 more...)

arXiv.org Artificial Intelligence

2407.19568

Country:

Asia > Middle East > Palestine (0.28)
Europe > Italy > Sicily (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
(19 more...)

Genre: Research Report (0.64)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback