AITopics | term extraction

Collaborating Authors

term extraction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Crossing Domains without Labels: Distant Supervision for Term Extraction

Senger, Elena, Campbell, Yuri, van der Goot, Rob, Plank, Barbara

arXiv.org Artificial IntelligenceOct-9-2025

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.06838

Country:

Europe > France (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Methods for Recognizing Nested Terms

Rozhkov, Igor, Loukachevitch, Natalia

arXiv.org Artificial IntelligenceMay-13-2025

Terms are defined as words or phrases that denote concepts of a specific domain, and knowing them is important for domain analysis, machine translation, or domain-specific information retrieval. V arious approaches have been proposed for automatic term extraction. However, automatic methods do not yet achieve the quality of manual term analysis. During recent years, machine learning methods have been intensively studied (Loukachevitch, 2012; Charalampakis et al., 2016; Nadif and Role, 2021). The application of machine learning improves the quality of term extraction, but requires creating training datasets. In addition, the transfer of a trained model from one domain to another usually leads to degradation of the performance of term extraction. Currently, language models (Xie et al., 2022; Liu et al., 2020) are texted in automatic term extraction.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.16007

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.50)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

Extracting domain-specific terms using contextual word embeddings

Repar, Andraž, Lavrač, Nada, Pollak, Senja

arXiv.org Artificial IntelligenceFeb-24-2025

Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.

corpus, extraction, frequency, (17 more...)

arXiv.org Artificial Intelligence

2502.17278

Country:

Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
Europe > Slovenia > Gorizia > Municipality of Nova Gorica > Nova Gorica (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
Asia > Malaysia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Add feedback

Benchmarking terminology building capabilities of ChatGPT on an English-Russian Fashion Corpus

Bezobrazova, Anastasiia, Seghiri, Miriam, Orasan, Constantin

arXiv.org Artificial IntelligenceDec-4-2024

This paper compares the accuracy of the terms extracted using SketchEngine, TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions produced by ChatGPT for these terms. The research is carried out on a comparable corpus of fashion magazines written in English and Russian collected from the web. A gold standard for the fashion terminology was also developed by identifying web pages that can be harvested automatically and contain definitions of terms from the fashion domain in English and Russian. This gold standard was used to evaluate the quality of the extracted terms and of the definitions produced. Our evaluation shows that TBXTools and SketchEngine, while capable of high recall, suffer from reduced precision as the number of terms increases, which affects their overall performance. Conversely, ChatGPT demonstrates superior performance, maintaining or improving precision as more terms are considered. Analysis of the definitions produced by ChatGPT for 60 commonly used terms in English and Russian shows that ChatGPT maintains a reasonable level of accuracy and fidelity across languages, but sometimes the definitions in both languages miss crucial specifics and include unnecessary deviations. Our research reveals that no single tool excels universally; each has strengths suited to particular aspects of terminology extraction and application.

chatgpt, corpus, sketchengine, (16 more...)

arXiv.org Artificial Intelligence

2412.03242

Country:

Europe > United Kingdom > England > Surrey (0.04)
Europe > Spain > Andalusia > Málaga Province > Málaga (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AskBeacon -- Performing genomic data exchange and analytics with natural language

Wickramarachchi, Anuradha, Tonni, Shakila, Majumdar, Sonali, Karimi, Sarvnaz, Kõks, Sulev, Hosking, Brendan, Rambla, Jordi, Twine, Natalie A., Jain, Yatish, Bauer, Denis C.

arXiv.org Artificial IntelligenceOct-22-2024

For the two investigated workflows, there are significant difference in the prediction of variants terms and additional phenotypic filtering terms. An intuitive comparison between the parallel and multistep extraction model is that, in the parallel workflow the models' instructions are rather simple, where the model is asked to predict only variants specific fields (variants extractor template) and other fields (filter extractor template) not concerning about the presence of the fields in the Beacon schema. Not all extracted terms in this extractor chain are valid for Beacon. A further validator template is further required here to filter out the terms that are not related to Beacon. In contrast, in the multistep workflow, both the variants and phenotypic terms are extracted only when they match with the beacon schema without the necessity of the validation prompt. Thus, although these models are predicting less terms, the extracted terms are aligned with the schema with less hallucination than the Parallel schema, as seen in previous section.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.167

Country: Oceania > Australia (0.47)

Genre:

Workflow (0.93)
Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature

Delaunay, Julien, Tran, Hanh Thi Hong, González-Gallardo, Carlos-Emiliano, Bordea, Georgeta, Ducos, Mathilde, Sidere, Nicolas, Doucet, Antoine, Pollak, Senja, De Viron, Olivier

arXiv.org Artificial IntelligenceJun-13-2024

The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.

automatic term extraction, extraction, term extraction, (15 more...)

arXiv.org Artificial Intelligence

2406.09128

Country:

Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PyABSA: A Modularized Framework for Reproducible Aspect-based Sentiment Analysis

Yang, Heng, Zhang, Chen, Li, Ke

arXiv.org Artificial IntelligenceAug-22-2023

The advancement of aspect-based sentiment analysis (ABSA) has urged the lack of a user-friendly framework that can largely lower the difficulty of reproducing state-of-the-art ABSA performance, especially for beginners. To meet the demand, we present \our, a modularized framework built on PyTorch for reproducible ABSA. To facilitate ABSA research, PyABSA supports several ABSA subtasks, including aspect term extraction, aspect sentiment classification, and end-to-end aspect-based sentiment analysis. Concretely, PyABSA integrates 29 models and 26 datasets. With just a few lines of code, the result of a model on a specific dataset can be reproduced. With a modularized design, PyABSA can also be flexibly extended to considered models, datasets, and other related tasks. Besides, PyABSA highlights its data augmentation and annotation features, which significantly address data scarcity. All are welcome to have a try at \url{https://github.com/yangheng95/PyABSA}.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2208.01368

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(18 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings

Färber, Michael, Popovic, Nicholas

arXiv.org Artificial IntelligenceAug-7-2023

In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2308.03519

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.06)
North America > United States > Hawaii (0.05)
South America > Colombia > Meta Department > Villavicencio (0.04)
(2 more...)

Genre: Research Report (0.40)

Industry:

Energy > Energy Storage (0.97)
Electrical Industrial Apparatus (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.72)

Add feedback

The Recent Advances in Automatic Term Extraction: A survey

Tran, Hanh Thi Hong, Martinc, Matej, Caporusso, Jaya, Doucet, Antoine, Pollak, Senja

arXiv.org Artificial IntelligenceJan-17-2023

Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2301.06767

Country:

South America > Brazil (0.14)
Europe > Slovenia (0.05)
Europe > France > Nouvelle-Aquitaine (0.04)
(9 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ensembling Transformers for Cross-domain Automatic Term Extraction

Tran, Hanh Thi Hong, Martinc, Matej, Pelicon, Andraz, Doucet, Antoine, Pollak, Senja

arXiv.org Artificial IntelligenceDec-11-2022

Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.

extraction, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-21756-2_7

2212.05696

Country:

Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
Europe > France > Nouvelle-Aquitaine (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.75)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback