AITopics | natural language computing

Collaborating Authors

natural language computing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Testing different Log Bases For Vector Model Weighting Technique

Assaf, Kamel

arXiv.org Artificial IntelligenceJul-12-2023

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.

information retrieval, natural language, test collection, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2023.12301

2307.06213

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

A Study on the Appropriate size of the Mongolian general corpus

Choi, Sunsoo, Tsend, Ganbat

arXiv.org Artificial IntelligenceJul-12-2023

This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps function and Type Token Ratio to determine the appropriate size of the Mongolian general corpus. The sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded from 39 to 42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens.

artificial intelligence, corpus, natural language, (15 more...)

arXiv.org Artificial Intelligence

2307.0605

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Mongolia > Ulaanbaatar (0.04)
(4 more...)

Genre: Research Report (0.65)

Industry: Media (0.36)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.97)

Add feedback

evaluating bert and parsbert for analyzing persian advertisement data

Mehrban, Ali, Ahadian, Pegah

arXiv.org Artificial IntelligenceMay-3-2023

This paper discusses the impact of the Internet on modern trading and the importance of data generated from these transactions for organizations to improve their marketing efforts. The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran, and presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The authors provide some background on data mining, Persian language, and the two language models, examine the dataset's composition and statistical features, and provide details on their fine-tuning and training configurations for both approaches. They present the results of their analysis and highlight the strengths and weaknesses of the two language models when applied to Persian text data. The paper offers valuable insights into the challenges and opportunities of working with low-resource languages such as Persian and the potential of advanced language models like BERT for analyzing such data. The paper also explains the data mining process, including steps such as data cleaning and normalization techniques. Finally, the paper discusses the types of machine learning problems, such as supervised, unsupervised, and reinforcement learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper provides an informative overview of the use of language models and data mining techniques for analyzing text data in low-resource languages, using the example of the Divar dataset.

artificial intelligence, data mining, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2305.02426

Country:

Asia > Middle East > Iran (0.25)
Asia > Afghanistan (0.05)
North America > United States > Ohio (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Materials > Metals & Mining (0.69)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.90)

Add feedback

Stress Test for BERT and Deep Models: Predicting Words from Italian Poetry

Delmonte, Rodolfo, Busetto, Nicolò

arXiv.org Artificial IntelligenceJan-21-2023

In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of 900 tokens - mostly extracted from Italian poetry of the first half of last century. We used then sentences from the newswire domain containing similar syntactic structures. The results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases. In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary words. NTRODUCTION In this paper we report results of an extremely complex task for BERT: predicting the masked word in sentences extracted from Italian poetry of beginning of last century, using the output of the first projection layer of a Deep Learning model, the raw word embeddings. We decided to work on Italian to highlight its difference from English in an extended number of relevant linguistic properties. The underlying hypothesis aims at proving the ability of BERT [1] to predict masked words with increasing complex contexts. To verify this hypothesis we selected sentences that exhibit two important features of Italian texts, non-canonicity and presence of words with very low or rare frequency. To better evaluate the impact of these two factors on word predictability we created a word predictability measure which is based on a combination of scoring functions for context and word frequency of (co-)occurrence. The experiment uses BERT assuming that DNNs can be regarded capable of modeling the behaviour of the human brain in predicting a next word given a sentence and text corpus - but see the following section. It is usually the case that paradigmatic and syntagmatic properties of words in a sentence are tested separately.

artificial intelligence, machine learning, text processing, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2022.11602

2302.09303

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(4 more...)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

Behre, Piyush, Tan, Sharman, Varadharajan, Padma, Chang, Shuangyu

arXiv.org Artificial IntelligenceJan-10-2023

While speech recognition Word Error Rate (WER) has reached human parity for English, continuous speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. Context within the segments produced by ASR decoders can be helpful but limiting in overall punctuation performance for a continuous speech session. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. Streaming punctuation achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine Translation (MT). NTRODUCTION Our hybrid Automatic Speech Recognition (ASR) generates punctuation with two systems working together. First, the decoder generates text segments and passes them to the Display Post Processor (DPP).

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2022.11601

2301.03819

Country: North America > United States > New York (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Integrating extracted information from bert and multiple embedding methods with the deep neural network for humour detection

Miraj, Rida, Aono, Masaki

arXiv.org Artificial IntelligenceMay-11-2021

Humour detection from sentences has been an interesting and challenging task in the last few years. In attempts to highlight humour detection, most research was conducted using traditional approaches of embedding, e.g., Word2Vec or Glove. Recently BERT sentence embedding has also been used for this task. In this paper, we propose a framework for humour detection in short texts taken from news headlines. Our proposed framework (IBEN) attempts to extract information from written text via the use of different layers of BERT. After several trials, weights were assigned to different layers of the BERT model. The extracted information was then sent to a Bi-GRU neural network as an embedding matrix. We utilized the properties of some external embedding models. A multi-kernel convolution in our neural network was also employed to extract higher-level sentence representations. This framework performed very well on the task of humour detection.

detection, humour detection, natural language computing, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2021.10202

2105.05112

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Topic Extraction of Crawled Documents Collection using Correlated Topic Model in MapReduce Framework

Oo, Mi Khine, Khine, May Aye

arXiv.org Machine LearningJan-6-2020

The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.

arXiv.org Machine Learning

2001.01669

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Myanmar > Yangon Region > Yangon (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback

Annotated Guidelines and Building Reference Corpus for Myanmar-English Word Alignment

Han, Nway Nway, Thida, Aye

arXiv.org Artificial IntelligenceSep-25-2019

Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar - English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines f or Myanmar - English word alignment annotation between two languages over contrastive learning and built the Myanmar - English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus conta ins confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual w ords. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores. A bilingual corpus aligned at the level of sentences or words is a precious resource for developing machine translation systems. Word alignment is a fundamental step in extracting translation information from bilingual corpus and determines which words and phrases are translations of each other in the original and translated sentence. In most translation systems, translational correspondences are rather complex; for a language pair such as Myanmar and Eng lish that belong to the different word order languages.

alignment, myanmar, reference corpus, (14 more...)

arXiv.org Artificial Intelligence

1909.11288

Country:

Asia > Myanmar > Mandalay Region > Mandalay (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Syntactic Analysis Based on Morphological Characteristic Features of the Romanian Language

Patrut, Bogdan

arXiv.org Artificial IntelligenceJan-9-2013

This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a "story" (a set of sentences describing some events from the real life).

artificial intelligence, helen, natural language computing, (16 more...)

arXiv.org Artificial Intelligence

1301.195

Country:

Europe > Romania > Nord-Est Development Region > Bacău County > Bacău (0.05)
North America > United States > Maine (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
(4 more...)

Genre: Research Report (0.40)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora

Ramírez, Jessica C., Matsumoto, Yuji

arXiv.org Artificial IntelligenceNov-19-2012

The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.

artificial intelligence, machine translation, natural language, (15 more...)

arXiv.org Artificial Intelligence

1211.4488

Country:

North America > United States (0.14)
Asia > Taiwan (0.14)
Asia > Japan (0.14)
Asia > India (0.14)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)

Add feedback