term extraction
Crossing Domains without Labels: Distant Supervision for Term Extraction
Senger, Elena, Campbell, Yuri, van der Goot, Rob, Plank, Barbara
Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.
- Europe > Bulgaria (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (9 more...)
Methods for Recognizing Nested Terms
Rozhkov, Igor, Loukachevitch, Natalia
Terms are defined as words or phrases that denote concepts of a specific domain, and knowing them is important for domain analysis, machine translation, or domain-specific information retrieval. V arious approaches have been proposed for automatic term extraction. However, automatic methods do not yet achieve the quality of manual term analysis. During recent years, machine learning methods have been intensively studied (Loukachevitch, 2012; Charalampakis et al., 2016; Nadif and Role, 2021). The application of machine learning improves the quality of term extraction, but requires creating training datasets. In addition, the transfer of a trained model from one domain to another usually leads to degradation of the performance of term extraction. Currently, language models (Xie et al., 2022; Liu et al., 2020) are texted in automatic term extraction.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
- Asia > Russia (0.05)
- Asia > Japan (0.04)
Extracting domain-specific terms using contextual word embeddings
Repar, Andraž, Lavrač, Nada, Pollak, Senja
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.
- Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
- Europe > Slovenia > Gorizia > Municipality of Nova Gorica > Nova Gorica (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Asia > Malaysia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
Benchmarking terminology building capabilities of ChatGPT on an English-Russian Fashion Corpus
Bezobrazova, Anastasiia, Seghiri, Miriam, Orasan, Constantin
This paper compares the accuracy of the terms extracted using SketchEngine, TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions produced by ChatGPT for these terms. The research is carried out on a comparable corpus of fashion magazines written in English and Russian collected from the web. A gold standard for the fashion terminology was also developed by identifying web pages that can be harvested automatically and contain definitions of terms from the fashion domain in English and Russian. This gold standard was used to evaluate the quality of the extracted terms and of the definitions produced. Our evaluation shows that TBXTools and SketchEngine, while capable of high recall, suffer from reduced precision as the number of terms increases, which affects their overall performance. Conversely, ChatGPT demonstrates superior performance, maintaining or improving precision as more terms are considered. Analysis of the definitions produced by ChatGPT for 60 commonly used terms in English and Russian shows that ChatGPT maintains a reasonable level of accuracy and fidelity across languages, but sometimes the definitions in both languages miss crucial specifics and include unnecessary deviations. Our research reveals that no single tool excels universally; each has strengths suited to particular aspects of terminology extraction and application.
AskBeacon -- Performing genomic data exchange and analytics with natural language
Wickramarachchi, Anuradha, Tonni, Shakila, Majumdar, Sonali, Karimi, Sarvnaz, Kõks, Sulev, Hosking, Brendan, Rambla, Jordi, Twine, Natalie A., Jain, Yatish, Bauer, Denis C.
For the two investigated workflows, there are significant difference in the prediction of variants terms and additional phenotypic filtering terms. An intuitive comparison between the parallel and multistep extraction model is that, in the parallel workflow the models' instructions are rather simple, where the model is asked to predict only variants specific fields (variants extractor template) and other fields (filter extractor template) not concerning about the presence of the fields in the Beacon schema. Not all extracted terms in this extractor chain are valid for Beacon. A further validator template is further required here to filter out the terms that are not related to Beacon. In contrast, in the multistep workflow, both the variants and phenotypic terms are extracted only when they match with the beacon schema without the necessity of the validation prompt. Thus, although these models are predicting less terms, the extracted terms are aligned with the schema with less hallucination than the Parallel schema, as seen in previous section.
- Oceania > Australia > Western Australia > Perth (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
- Workflow (0.93)
- Research Report (0.64)
CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature
Delaunay, Julien, Tran, Hanh Thi Hong, González-Gallardo, Carlos-Emiliano, Bordea, Georgeta, Ducos, Mathilde, Sidere, Nicolas, Doucet, Antoine, Pollak, Senja, De Viron, Olivier
The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
PyABSA: A Modularized Framework for Reproducible Aspect-based Sentiment Analysis
Yang, Heng, Zhang, Chen, Li, Ke
The advancement of aspect-based sentiment analysis (ABSA) has urged the lack of a user-friendly framework that can largely lower the difficulty of reproducing state-of-the-art ABSA performance, especially for beginners. To meet the demand, we present \our, a modularized framework built on PyTorch for reproducible ABSA. To facilitate ABSA research, PyABSA supports several ABSA subtasks, including aspect term extraction, aspect sentiment classification, and end-to-end aspect-based sentiment analysis. Concretely, PyABSA integrates 29 models and 26 datasets. With just a few lines of code, the result of a model on a specific dataset can be reproduced. With a modularized design, PyABSA can also be flexibly extended to considered models, datasets, and other related tasks. Besides, PyABSA highlights its data augmentation and annotation features, which significantly address data scarcity. All are welcome to have a try at \url{https://github.com/yangheng95/PyABSA}.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (18 more...)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings
Färber, Michael, Popovic, Nicholas
In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.06)
- North America > United States > Hawaii (0.05)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (2 more...)
- Energy > Energy Storage (0.97)
- Electrical Industrial Apparatus (0.71)
The Recent Advances in Automatic Term Extraction: A survey
Tran, Hanh Thi Hong, Martinc, Matej, Caporusso, Jaya, Doucet, Antoine, Pollak, Senja
Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.
- South America > Brazil (0.14)
- Europe > Slovenia (0.05)
- Europe > France > Nouvelle-Aquitaine (0.04)
- (9 more...)
- Overview (1.00)
- Research Report > New Finding (0.66)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Ensembling Transformers for Cross-domain Automatic Term Extraction
Tran, Hanh Thi Hong, Martinc, Matej, Pelicon, Andraz, Doucet, Antoine, Pollak, Senja
Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Europe > France > Nouvelle-Aquitaine (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Asia (0.04)