natural language computing
Testing different Log Bases For Vector Model Weighting Technique
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
A Study on the Appropriate size of the Mongolian general corpus
This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps function and Type Token Ratio to determine the appropriate size of the Mongolian general corpus. The sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded from 39 to 42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Mongolia > Ulaanbaatar (0.04)
- (4 more...)
evaluating bert and parsbert for analyzing persian advertisement data
This paper discusses the impact of the Internet on modern trading and the importance of data generated from these transactions for organizations to improve their marketing efforts. The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran, and presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The authors provide some background on data mining, Persian language, and the two language models, examine the dataset's composition and statistical features, and provide details on their fine-tuning and training configurations for both approaches. They present the results of their analysis and highlight the strengths and weaknesses of the two language models when applied to Persian text data. The paper offers valuable insights into the challenges and opportunities of working with low-resource languages such as Persian and the potential of advanced language models like BERT for analyzing such data. The paper also explains the data mining process, including steps such as data cleaning and normalization techniques. Finally, the paper discusses the types of machine learning problems, such as supervised, unsupervised, and reinforcement learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper provides an informative overview of the use of language models and data mining techniques for analyzing text data in low-resource languages, using the example of the Divar dataset.
- Asia > Middle East > Iran (0.25)
- Asia > Afghanistan (0.05)
- North America > United States > Ohio (0.04)
- (4 more...)
Stress Test for BERT and Deep Models: Predicting Words from Italian Poetry
Delmonte, Rodolfo, Busetto, Nicolò
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of 900 tokens - mostly extracted from Italian poetry of the first half of last century. We used then sentences from the newswire domain containing similar syntactic structures. The results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases. In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary words. NTRODUCTION In this paper we report results of an extremely complex task for BERT: predicting the masked word in sentences extracted from Italian poetry of beginning of last century, using the output of the first projection layer of a Deep Learning model, the raw word embeddings. We decided to work on Italian to highlight its difference from English in an extended number of relevant linguistic properties. The underlying hypothesis aims at proving the ability of BERT [1] to predict masked words with increasing complex contexts. To verify this hypothesis we selected sentences that exhibit two important features of Italian texts, non-canonicity and presence of words with very low or rare frequency. To better evaluate the impact of these two factors on word predictability we created a word predictability measure which is based on a combination of scoring functions for context and word frequency of (co-)occurrence. The experiment uses BERT assuming that DNNs can be regarded capable of modeling the behaviour of the human brain in predicting a next word given a sentence and text corpus - but see the following section. It is usually the case that paradigmatic and syntagmatic properties of words in a sentence are tested separately.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (4 more...)
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition
Behre, Piyush, Tan, Sharman, Varadharajan, Padma, Chang, Shuangyu
While speech recognition Word Error Rate (WER) has reached human parity for English, continuous speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation and punctuation problems, resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. Context within the segments produced by ASR decoders can be helpful but limiting in overall punctuation performance for a continuous speech session. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. Streaming punctuation achieves an average BLEUscore improvement of 0.66 for the downstream task of Machine Translation (MT). NTRODUCTION Our hybrid Automatic Speech Recognition (ASR) generates punctuation with two systems working together. First, the decoder generates text segments and passes them to the Display Post Processor (DPP).
Integrating extracted information from bert and multiple embedding methods with the deep neural network for humour detection
Humour detection from sentences has been an interesting and challenging task in the last few years. In attempts to highlight humour detection, most research was conducted using traditional approaches of embedding, e.g., Word2Vec or Glove. Recently BERT sentence embedding has also been used for this task. In this paper, we propose a framework for humour detection in short texts taken from news headlines. Our proposed framework (IBEN) attempts to extract information from written text via the use of different layers of BERT. After several trials, weights were assigned to different layers of the BERT model. The extracted information was then sent to a Bi-GRU neural network as an embedding matrix. We utilized the properties of some external embedding models. A multi-kernel convolution in our neural network was also employed to extract higher-level sentence representations. This framework performed very well on the task of humour detection.
Topic Extraction of Crawled Documents Collection using Correlated Topic Model in MapReduce Framework
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Myanmar > Yangon Region > Yangon (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (4 more...)
Annotated Guidelines and Building Reference Corpus for Myanmar-English Word Alignment
Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar - English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines f or Myanmar - English word alignment annotation between two languages over contrastive learning and built the Myanmar - English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus conta ins confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual w ords. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores. A bilingual corpus aligned at the level of sentences or words is a precious resource for developing machine translation systems. Word alignment is a fundamental step in extracting translation information from bilingual corpus and determines which words and phrases are translations of each other in the original and translated sentence. In most translation systems, translational correspondences are rather complex; for a language pair such as Myanmar and Eng lish that belong to the different word order languages.
- Asia > Myanmar > Mandalay Region > Mandalay (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
- (7 more...)
Syntactic Analysis Based on Morphological Characteristic Features of the Romanian Language
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a "story" (a set of sentences describing some events from the real life).
- Europe > Romania > Nord-Est Development Region > Bacău County > Bacău (0.05)
- North America > United States > Maine (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (4 more...)
A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora
Ramírez, Jessica C., Matsumoto, Yuji
The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline.
- North America > United States (0.14)
- North America > Dominican Republic (0.05)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- (2 more...)