AITopics | annotated corpus

Collaborating Authors

annotated corpus

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Pignat, Johann, Vucetic, Milena, Gaudet-Blavignac, Christophe, Zaghir, Jamil, Stettler, Amandine, Amrein, Fanny, Bonjour, Jonatan, Goldman, Jean-Philippe, Michielin, Olivier, Lovis, Christian, Bjelogrlic, Mina

arXiv.org Artificial IntelligenceOct-17-2025

Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

annotation, artificial intelligence, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.13873

Country: Europe (0.29)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

Mariotti, Luca, Guidetti, Veronica, Mandreoli, Federica

arXiv.org Artificial IntelligenceJul-10-2025

The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE's minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.06895

Country:

Europe (0.93)
North America > United States (0.28)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)

Industry:

Health & Medicine (0.46)
Energy (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

MONOVAB : An Annotated Corpus for Bangla Multi-label Emotion Detection

Banshal, Sumit Kumar, Das, Sajal, Shammi, Shumaiya Akter, Chakraborty, Narayan Ranjan

arXiv.org Artificial IntelligenceSep-27-2023

In recent years, Sentiment Analysis (SA) and Emotion Recognition (ER) have been increasingly popular in the Bangla language, which is the seventh most spoken language throughout the entire world. However, the language is structurally complicated, which makes this field arduous to extract emotions in an accurate manner. Several distinct approaches such as the extraction of positive and negative sentiments as well as multiclass emotions, have been implemented in this field of study. Nevertheless, the extraction of multiple sentiments is an almost untouched area in this language. Which involves identifying several feelings based on a single piece of text. Therefore, this study demonstrates a thorough method for constructing an annotated corpus based on scrapped data from Facebook to bridge the gaps in this subject area to overcome the challenges. To make this annotation more fruitful, the context-based approach has been used. Bidirectional Encoder Representations from Transformers (BERT), a well-known methodology of transformers, have been shown the best results of all methods implemented. Finally, a web application has been developed to demonstrate the performance of the pre-trained top-performer model (BERT) for multi-label ER in Bangla.

annotated corpus, bangla multi-label emotion detection, monovab

arXiv.org Artificial Intelligence

2309.1567

Genre: Research Report (0.40)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.73)

Add feedback

CEREC: A Corpus for Entity Resolution in Email Conversations

Dakle, Parag Pravin, Moldovan, Dan I.

arXiv.org Artificial IntelligenceJun-1-2021

We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 59.2 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.

baseline, corpus, email thread, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2020.coling-main.30

2105.10606

Country:

North America > United States > Texas (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(10 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.73)

Add feedback

The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews

Tutubalina, Elena, Alimova, Ilseyar, Miftahutdinov, Zulfat, Sakhovskiy, Andrey, Malykh, Valentin, Nikolenko, Sergey

arXiv.org Artificial IntelligenceApr-7-2020

The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC

annotator, corpus, russian drug reaction corpus, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1093/bioinformatics/btaa675

2004.03659

Country:

North America > United States (0.28)
Asia > Russia (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
(2 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)

Add feedback

An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols

Kulkarni, Chaitanya, Xu, Wei, Ritter, Alan, Machiraju, Raghu

arXiv.org Artificial IntelligenceMay-1-2018

We describe an effort to annotate a corpus of natural language instructions consisting of 622 wet lab protocols to facilitate automatic or semi-automatic conversion of protocols into a machine-readable format and benefit biological research. Experimental results demonstrate the utility of our corpus for developing machine learning approaches to shallow semantic parsing of instructional texts. We make our annotated Wet Lab Protocol Corpus available to the research community.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

1805.00195

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

fekr/postagga

#artificialintelligenceSep-28-2017, 23:45:11 GMT

"But if thought corrupts language, language can also corrupt thought." You can use postagga to process annotated text samples into full fledged parsers capable of understanding "free speech" input as structured data. Ah and you'll be able to do this easily. The models are included under the models folder. We also shipped two light models as vars defined in namespaces, one for French and one for English, as for JavaScript, the artifacts size are a concern.

artificial intelligence, natural language, postagga, (19 more...)

#artificialintelligence

Industry: Law (0.35)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.74)

Add feedback

A Word Embedding and a Josa Vector for Korean Unsupervised Semantic Role Induction

Nam, Kyeong-Min (Hallym Universitry) | Kim, Yu-Seop (Hallym Universitry)

AAAI ConferencesApr-19-2016

We propose an unsupervised semantic role labeling method for Korean language, one of the agglutinative languages which have complicated suffix structures telling much of syntactic. First, we construct an argument embedding and then develop a indicator vector of the suffix such as a Josa. And, we construct an argument tuple by concatenating above two vectors. The role induction is performed by clustering the argument tuples.These method which achieves up to a 70.16% of F1-score and 75.85% of accuracy.

argument, machine learning, natural language, (17 more...)

AAAI Conferences

Thirtieth AAAI Conference on Artificial Intelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Rapid Adaptation of POS Tagging for Domain Specific Uses

Miller, John E., Bloodgood, Michael, Torii, Manabu, Vijay-Shanker, K.

arXiv.org Machine LearningOct-31-2014

Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervised in that a manually annotated corpus for the new domain is not necessary. We use suffix information gathered from large amounts of raw text as well as orthographic information to increase the lexical coverage. We present an experiment in the Biological domain where our POS tagger achieves results comparable to POS taggers specifically trained to this domain. Many machine-learning and statistical techniques employed for POS tagging train a model on an annotated corpus, such as the Penn Treebank (Marcus et al, 1993). Most state-of-the-art POS taggers use two main sources of information: 1) Information about neighboring tags, and 2) Information about the word itself. Methods using both sources of information for tagging are: Hidden Markov Modeling, Maximum Entropy modeling, and Transformation Based Learning (Brill, 1995).

artificial intelligence, information, natural language, (15 more...)

arXiv.org Machine Learning

1411.0007

Country: North America > United States > Delaware (0.15)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Studying Properties of Czech Complex Sentences from an Annotated Corpus

Kubon, Vladislav (Charles University in Prague) | Lopatkova, Marketa (Charles University in Prague)

AAAI ConferencesMay-18-2011

The paper deals with the problem of an analysis of complex sentences in Czech on the basis of manually annotated data. The availability of a specialized corpus explicitly describing mutual relationships between segments and clauses in Czech complex sentences, together with the availability of a thoroughly syntactically annotated corpus, the Prague Dependency Treebank, provide a solid background for linguistic investigation. The paper presents quantitative, linguistic and structural observations which provide a number of clues for building an algorithm for analyzing a structure of complex sentences in the future.

artificial intelligence, complex sentence, natural language, (18 more...)

AAAI Conferences

Twenty-Fourth International FLAIRS Conference

Country:

Europe > Czechia > Prague (0.26)
Africa > Kenya > Narok County > Narok (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > Singapore (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.46)

Add feedback