automatic term extraction
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Chun, Yongchan, Kim, Minhyuk, Kim, Dongjun, Park, Chanjun, Lim, Heuiseok
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
- Europe > Belgium (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Methods for Recognizing Nested Terms
Rozhkov, Igor, Loukachevitch, Natalia
Terms are defined as words or phrases that denote concepts of a specific domain, and knowing them is important for domain analysis, machine translation, or domain-specific information retrieval. V arious approaches have been proposed for automatic term extraction. However, automatic methods do not yet achieve the quality of manual term analysis. During recent years, machine learning methods have been intensively studied (Loukachevitch, 2012; Charalampakis et al., 2016; Nadif and Role, 2021). The application of machine learning improves the quality of term extraction, but requires creating training datasets. In addition, the transfer of a trained model from one domain to another usually leads to degradation of the performance of term extraction. Currently, language models (Xie et al., 2022; Liu et al., 2020) are texted in automatic term extraction.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
- Asia > Russia (0.05)
- Asia > Japan (0.04)
CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature
Delaunay, Julien, Tran, Hanh Thi Hong, González-Gallardo, Carlos-Emiliano, Bordea, Georgeta, Ducos, Mathilde, Sidere, Nicolas, Doucet, Antoine, Pollak, Senja, De Viron, Olivier
The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings
Färber, Michael, Popovic, Nicholas
In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.06)
- North America > United States > Hawaii (0.05)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (2 more...)
- Energy > Energy Storage (0.97)
- Electrical Industrial Apparatus (0.71)
The Recent Advances in Automatic Term Extraction: A survey
Tran, Hanh Thi Hong, Martinc, Matej, Caporusso, Jaya, Doucet, Antoine, Pollak, Senja
Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.
- South America > Brazil (0.14)
- Europe > Slovenia (0.05)
- Europe > France > Nouvelle-Aquitaine (0.04)
- (9 more...)
- Overview (1.00)
- Research Report > New Finding (0.66)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Ensembling Transformers for Cross-domain Automatic Term Extraction
Tran, Hanh Thi Hong, Martinc, Matej, Pelicon, Andraz, Doucet, Antoine, Pollak, Senja
Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Europe > France > Nouvelle-Aquitaine (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Asia (0.04)