A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem
Truică, Ciprian-Octavian, Istrate, Neculai-Ovidiu, Apostol, Elena-Simona
–arXiv.org Artificial Intelligence
Automatic Term Recognition (ATR) is used to extract domain-specific terms that create the terminology of the domain. A term can be defined as a linguistic structure or a concept and it is composed of one or more words with a specific meaning to a domain. With the exponential growth of technical and scientific articles, new domain-specific terms appear daily as named entities (e.g., Apache Spark), idioms (e.g., Big Data), multi-word expressions (e.g., recurrent neural networks), or through semantic change and shifts (e.g., local neighborhood). Methods that can automatically recognize and extract these domain-specific terms are useful for both scientists and professionals to improve existing systems (i.e., WordNet [4], OntoLex-FRaC [3]) that deal with linguistics, terminology, and machine-readable technologies. ATR methods [6, 7, 5, 8, 9, 11] consist of two main phases. The first phase is extracting a list of candidate terms that will later be used by scoring metrics to rank their importance to a given domain. To extract this list, words are tagged with their part of speech (PoS), and candidate multi-word terms are extracted using language-dependent linguistic filters [10]. The second phase is specific to each method and involves computing a score of domain relevance by using different term statistics, e.g., frequency, context, number of similar terms, etc. Users can process large volumes of textual data when employing ATR methods. The extraction and recognition of domain-specific terms can be improved by developing the methods on top of distributed ecosystems such as Apache Hadoop and Apache Spark.
arXiv.org Artificial Intelligence
May-24-2023