A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

Truică, Ciprian-Octavian, Istrate, Neculai-Ovidiu, Apostol, Elena-Simona

May-24-2023–arXiv.org Artificial Intelligence

Automatic Term Recognition (ATR) is used to extract domain-specific terms that create the terminology of the domain. A term can be defined as a linguistic structure or a concept and it is composed of one or more words with a specific meaning to a domain. With the exponential growth of technical and scientific articles, new domain-specific terms appear daily as named entities (e.g., Apache Spark), idioms (e.g., Big Data), multi-word expressions (e.g., recurrent neural networks), or through semantic change and shifts (e.g., local neighborhood). Methods that can automatically recognize and extract these domain-specific terms are useful for both scientists and professionals to improve existing systems (i.e., WordNet [4], OntoLex-FRaC [3]) that deal with linguistics, terminology, and machine-readable technologies. ATR methods [6, 7, 5, 8, 9, 11] consist of two main phases. The first phase is extracting a list of candidate terms that will later be used by scoring metrics to rank their importance to a given domain. To extract this list, words are tagged with their part of speech (PoS), and candidate multi-word terms are extracted using language-dependent linguistic filters [10]. The second phase is specific to each method and involves computing a score of domain relevance by using different term statistics, e.g., frequency, context, number of similar terms, etc. Users can process large volumes of textual data when employing ATR methods. The extraction and recognition of domain-specific terms can be improved by developing the methods on top of distributed ecosystems such as Apache Hadoop and Apache Spark.

data mining, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

May-24-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.34)
    - Natural Language > Text Processing (0.70)
  - Data Science > Data Mining
    - Big Data (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found