AITopics | Information Extraction

Collaborating Authors

Information Extraction

News Overviews Instructional Materials AI-Alerts Classics

FSUIE: A Novel Fuzzy Span Mechanism for Universal Information Extraction

Peng, Tianshuo, Li, Zuchao, Zhang, Lefei, Du, Bo, Zhao, Hai

arXiv.org Artificial IntelligenceJun-19-2023

Universal Information Extraction (UIE) has been introduced as a unified framework for various Information Extraction (IE) tasks and has achieved widespread success. Despite this, UIE models have limitations. For example, they rely heavily on span boundaries in the data during training, which does not reflect the reality of span annotation challenges. Slight adjustments to positions can also meet requirements. Additionally, UIE models lack attention to the limited span length feature in IE. To address these deficiencies, we propose the Fuzzy Span Universal Information Extraction (FSUIE) framework. Specifically, our contribution consists of two concepts: fuzzy span loss and fuzzy span attention. Our experimental results on a series of main IE tasks show significant improvement compared to the baseline, especially in terms of fast convergence and strong performance with small amounts of data and training epochs. These results demonstrate the effectiveness and generalization of FSUIE in different tasks, settings, and scenarios.

artificial intelligence, computational linguistic, natural language, (14 more...)

arXiv.org Artificial Intelligence

2306.14913

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Asia > China > Hubei Province > Wuhan (0.04)
(12 more...)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis

Belal, Mohammad, She, James, Wong, Simon

arXiv.org Artificial IntelligenceJun-18-2023

Sentiment analysis is a well-known natural language processing task that involves identifying the emotional tone or polarity of a given piece of text. With the growth of social media and other online platforms, sentiment analysis has become increasingly crucial for businesses and organizations seeking to monitor and comprehend customer feedback as well as opinions. Supervised learning algorithms have been popularly employed for this task, but they require human-annotated text to create the classifier. To overcome this challenge, lexicon-based tools have been used. A drawback of lexicon-based algorithms is their reliance on pre-defined sentiment lexicons, which may not capture the full range of sentiments in natural language. ChatGPT is a new product of OpenAI and has emerged as the most popular AI product. It can answer questions on various topics and tasks. This study explores the use of ChatGPT as a tool for data labeling for different sentiment analysis tasks. It is evaluated on two distinct sentiment analysis datasets with varying purposes. The results demonstrate that ChatGPT outperforms other lexicon-based unsupervised methods with significant improvements in overall accuracy. Specifically, compared to the best-performing lexical-based algorithms, ChatGPT achieves a remarkable increase in accuracy of 20% for the tweets dataset and approximately 25% for the Amazon reviews dataset. These findings highlight the exceptional performance of ChatGPT in sentiment analysis tasks, surpassing existing lexicon-based approaches by a significant margin. The evidence suggests it can be used for annotation on different sentiment analysis events and taskss.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.17177

Country:

Asia > Middle East > Qatar (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.70)

Industry: Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
(2 more...)

Add feedback

Universal Information Extraction with Meta-Pretrained Self-Retrieval

Yu, Xin Cong. Bowen, Fang, Mengcheng, Liu, Tingwen, Yu, Haiyang, Hu, Zhongkai, Huang, Fei, Li, Yongbin, Wang, Bin

arXiv.org Artificial IntelligenceJun-17-2023

Universal Information Extraction~(Universal IE) aims to solve different extraction tasks in a uniform text-to-structure generation manner. Such a generation procedure tends to struggle when there exist complex information structures to be extracted. Retrieving knowledge from external knowledge bases may help models to overcome this problem but it is impossible to construct a knowledge base suitable for various IE tasks. Inspired by the fact that large amount of knowledge are stored in the pretrained language models~(PLM) and can be retrieved explicitly, in this paper, we propose MetaRetriever to retrieve task-specific knowledge from PLMs to enhance universal IE. As different IE tasks need different knowledge, we further propose a Meta-Pretraining Algorithm which allows MetaRetriever to quicktly achieve maximum task-specific retrieval performance when fine-tuning on downstream IE tasks. Experimental results show that MetaRetriever achieves the new state-of-the-art on 4 IE tasks, 12 datasets under fully-supervised, low-resource and few-shot scenarios.

artificial intelligence, metaretriever, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.10444

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
North America > Canada > Ontario (0.04)
(14 more...)

Genre: Research Report > New Finding (0.87)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

Opinion Tree Parsing for Aspect-based Sentiment Analysis

Bao, Xiaoyi, Jiang, Xiaotong, Wang, Zhongqing, Zhang, Yue, Zhou, Guodong

arXiv.org Artificial IntelligenceJun-15-2023

Extracting sentiment elements using pre-trained generative models has recently led to large improvements in aspect-based sentiment analysis benchmarks. However, these models always need large-scale computing resources, and they also ignore explicit modeling of structure between sentiment elements. To address these challenges, we propose an opinion tree parsing model, aiming to parse all the sentiment elements from an opinion tree, which is much faster, and can explicitly reveal a more comprehensive and complete aspect-level sentiment structure. In particular, we first introduce a novel context-free opinion grammar to normalize the opinion tree structure. We then employ a neural chart-based opinion tree parser to fully explore the correlations among sentiment elements and parse them into an opinion tree structure. Extensive experiments show the superiority of our proposed model and the capacity of the opinion tree parser with the proposed context-free opinion grammar. More importantly, the results also prove that our model is much faster than previous models.

artificial intelligence, computational linguistic, natural language, (13 more...)

arXiv.org Artificial Intelligence

2306.08925

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
North America > Dominican Republic (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.86)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.72)

Add feedback

WebIE: Faithful and Robust Information Extraction on the Web

Whitehouse, Chenxi, Vania, Clara, Aji, Alham Fikri, Christodoulopoulos, Christos, Pierleoni, Andrea

arXiv.org Artificial IntelligenceJun-15-2023

Extracting structured and grounded fact triples from raw text is a fundamental task in Information Extraction (IE). Existing IE datasets are typically collected from Wikipedia articles, using hyperlinks to link entities to the Wikidata knowledge base. However, models trained only on Wikipedia have limitations when applied to web domains, which often contain noisy text or text that does not have any factual information. We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences automatically collected from the English Common Crawl corpus. WebIE also includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web. We annotate ~21K triples from WebIE through crowdsourcing and introduce mWebIE, a translation of the annotated set in four other languages: French, Spanish, Portuguese, and Hindi. We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability. We also propose three training strategies that use entity linking as an auxiliary task. Our experiments show that adding Entity-Linking objectives improves the faithfulness of our generative IE models.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.14293

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
(20 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.93)
Law (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

Add feedback

AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian

Çano, Erion

arXiv.org Artificial IntelligenceJun-14-2023

Lack of available resources such as text corpora for low-resource languages seriously hinders research on natural language processing and computational linguistics. This paper presents AlbMoRe, a corpus of 800 sentiment annotated movie reviews in Albanian. Each text is labeled as positive or negative and can be used for sentiment analysis research. Preliminary results based on traditional machine learning classifiers trained with the AlbMoRe samples are also reported. They can serve as comparison baselines for future research experiments.

artificial intelligence, computational linguistic, natural language, (15 more...)

arXiv.org Artificial Intelligence

2306.08526

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York > New York County > New York City (0.05)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.05)
(6 more...)

Genre: Research Report (0.52)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.76)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.76)

Add feedback

Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts

Leavy, Susan, Meaney, Gerardine, Wade, Karen, Greene, Derek

arXiv.org Artificial IntelligenceJun-13-2023

The increasing availability of digital collections of historical and contemporary literature presents a wealth of possibilities for new research in the humanities. The scale and diversity of such collections however, presents particular challenges in identifying and extracting relevant content. This paper presents Curatr, an online platform for the exploration and curation of literature with machine learning-supported semantic search, designed within the context of digital humanities scholarship. The platform provides a text mining workflow that combines neural word embeddings with expert domain knowledge to enable the generation of thematic lexicons, allowing researches to curate relevant sub-corpora from a large corpus of 18th and 19th century digitised texts.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-030-36599-8_31

2306.0802

Country:

Asia > Middle East > Israel (0.05)
Europe > France (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.68)
Health & Medicine > Epidemiology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.68)

Add feedback

Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark

Augustyniak, Łukasz, Woźniak, Szymon, Gruza, Marcin, Gramacki, Piotr, Rajda, Krzysztof, Morzy, Mikołaj, Kajdanowicz, Tomasz

arXiv.org Artificial IntelligenceJun-13-2023

Despite impressive advancements in multilingual corpora collection and model training, developing large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature based on strict quality criteria. The corpus covers 27 languages representing 6 language families. Datasets can be queried using several linguistic and functional features. In addition, we present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.07902

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Qatar (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
(21 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Weakly supervised information extraction from inscrutable handwritten document images

Paul, Sujoy, Madan, Gagan, Mishra, Akankshya, Hegde, Narayan, Kumar, Pradeep, Aggarwal, Gaurav

arXiv.org Artificial IntelligenceJun-11-2023

State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs > 2.5 better in medicine names extraction from prescriptions.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.06823

Country: Asia > India (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.61)

Add feedback

Instruction Tuning for Few-Shot Aspect-Based Sentiment Analysis

Varia, Siddharth, Wang, Shuai, Halder, Kishaloy, Vacareanu, Robert, Ballesteros, Miguel, Benajiba, Yassine, John, Neha Anna, Anubhai, Rishita, Muresan, Smaranda, Roth, Dan

arXiv.org Artificial IntelligenceJun-11-2023

Aspect-based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task which involves four elements from user-generated texts: aspect term, aspect category, opinion term, and sentiment polarity. Most computational approaches focus on some of the ABSA sub-tasks such as tuple (aspect term, sentiment polarity) or triplet (aspect term, opinion term, sentiment polarity) extraction using either pipeline or joint modeling approaches. Recently, generative approaches have been proposed to extract all four elements as (one or more) quadruplets from text as a single task. In this work, we take a step further and propose a unified framework for solving ABSA, and the associated sub-tasks to improve the performance in few-shot scenarios. To this end, we fine-tune a T5 model with instructional prompts in a multi-task learning fashion covering all the sub-tasks, as well as the entire quadruple prediction task. In experiments with multiple benchmark datasets, we show that the proposed multi-task prompting approach brings performance boost (by absolute 8.29 F1) in the few-shot learning setting.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.06629

Country:

North America > United States > Arizona > Pima County > Tucson (0.14)
North America > United States > Colorado > Denver County > Denver (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback