Goto

Collaborating Authors

 inverse document frequency


Feature Engineering in Learning-to-Rank for Community Question Answering Task

arXiv.org Artificial Intelligence

Community question answering (CQA) forums are Internet-based platforms where users ask questions about a topic and other expert users try to provide solutions. Many CQA forums such as Quora, Stackoverflow, Yahoo!Answer, StackExchange exist with a lot of user-generated data. These data are leveraged in automated CQA ranking systems where similar questions (and answers) are presented in response to the query of the user. In this work, we empirically investigate a few aspects of this domain. Firstly, in addition to traditional features like TF-IDF, BM25 etc., we introduce a BERT-based feature that captures the semantic similarity between the question and answer. Secondly, most of the existing research works have focused on features extracted only from the question part; features extracted from answers have not been explored extensively. We combine both types of features in a linear fashion. Thirdly, using our proposed concepts, we conduct an empirical investigation with different rank-learning algorithms, some of which have not been used so far in CQA domain. On three standard CQA datasets, our proposed framework achieves state-of-the-art performance. We also analyze importance of the features we use in our investigation. This work is expected to guide the practitioners to select a better set of features for the CQA retrieval task.


Utilization of Multinomial Naive Bayes Algorithm and Term Frequency Inverse Document Frequency (TF-IDF Vectorizer) in Checking the Credibility of News Tweet in the Philippines

arXiv.org Artificial Intelligence

The digitalization of news media become a good indicator of progress and signal to more threats. Media disinformation or fake news is one of these threats, and it is necessary to take any action in fighting disinformation. This paper utilizes ground truth-based annotations and TF-IDF as feature extraction for the news articles which is then used as a training data set for Multinomial Naive Bayes. The model has an accuracy of 99.46% in training and 88.98% in predicting unseen data. Tagging fake news as real news is a concerning point on the prediction that is indicated in the F1 score of 89.68%. This could lead to a negative impact. To prevent this to happen it is suggested to further improve the corpus collection, and use an ensemble machine learning to reinforce the prediction


Understanding TF-IDF in NLP: A Comprehensive Guide

#artificialintelligence

Natural Language Processing (NLP) is an area of computer science that focuses on the interaction between human language and computers. One of the fundamental tasks of NLP is to extract relevant information from large volumes of unstructured data. In this article, we will explore one of the most popular techniques used in NLP called TF-IDF. TF-IDF is a numerical statistic that reflects the importance of a word in a document. It is commonly used in NLP to represent the relevance of a term to a document or a corpus of documents.


Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

arXiv.org Artificial Intelligence

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.


Accuracy of the Uzbek stop words detection: a case study on "School corpus"

arXiv.org Artificial Intelligence

Stop words are very important for information retrieval and text analysis investigation tasks of natural language processing. Current work presents a method to evaluate the quality of a list of stop words aimed at automatically creating techniques. Although the method proposed in this paper was tested on an automatically-generated list of stop words for the Uzbek language, it can be, with some modifications, applied to similar languages either from the same family or the ones that have an agglutinative nature. Since the Uzbek language belongs to the family of agglutinative languages, it can be explained that the automatic detection of stop words in the language is a more complex process than in inflected languages. Moreover, we integrated our previous work on stop words detection in the example of the "School corpus" by investigating how to automatically analyse the detection of stop words in Uzbek texts. This work is devoted to answering whether there is a good way of evaluating available stop words for Uzbek texts, or whether it is possible to determine what part of the Uzbek sentence contains the majority of the stop words by studying the numerical characteristics of the probability of unique words. The results show acceptable accuracy of the stop words lists.


Theory Behind the Basics of NLP - Analytics Vidhya

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Natural Language Processing (NLP) can help you to understand any text's sentiments. This is helpful for people to understand the emotions and the type of text they are looking over. Negative and Positive comments can be easily differentiated. NLP wanted to make machines understand the text or comment the same way humans can.


Three Unique Architectures For Deep Learning Based Recommendation Systems

#artificialintelligence

Deep learning based recommendation system architectures make use of multiple simpler approaches in order to remediate the shortcomings of any single approach to extracting, transforming and vectorizing a large corpus of data into a useful recommendation for an end user. High-level extraction architectures are useful for categorization, but lack accuracy. Low-level extraction approaches will produce committed decisions about what to recommend, but, since they lack context, their recommendations may be banal, repetitive or even recursive, creating unintelligent'content bubbles' for the user. High level architectures cannot'zoom in' meaningfully, and low-level architectures cannot'step back' to understand the bigger picture that the data is presenting. In this article we'll take a look at three unique approaches that reconcile these two needs into effective and unified frameworks suitable for recommender systems.


Combining NLP and Machine Learning for Document Classification

#artificialintelligence

Text mining is a popular topic for exploring what text you have in documents etc. Text mining and NLP can help you discover different patterns in the text like uncovering certain words or phases which are commonly used, to identifying certain patterns and linkages between different texts/documents. Combining this work on Text mining you can use Word Clouds, time-series analysis, etc to discover other aspects and patterns in the text. Check out my previous blog posts (post 1, post 2) on performing Text Mining on documents (manifestos from some of the political parties from the last two national government elections in Ireland). These two posts gives you a simple indication of what is possible.


Text preprocessing techniques- Twitter Data

#artificialintelligence

Text files contain enormous amounts of information. Language data analysis is the most difficult task for a computer to perform since a computer cannot understand the semantics of text. In order to accomplish this, we convert text data into a machine-readable format. Data in text format is converted to numerical values (or vectors) by text processing, so that these vectors may be given to the machine as input and analysed with the algebraic principles. However, there's a chance of data loss if we go through with the transition.


Have you thought-How Computer Interacts with the Humans?

#artificialintelligence

NLP stands for Natural Language Processing.It is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. NLP has the ability of a computer to understand, analyze, manipulate, and potentially generate human language. Just 21% of the available data is present in the organized form in the 21st century. Millions of tweets, emails and web searches are generated daily, resulting in a huge amount of data increasing by the minute..And most of these data are in the form of text and unstructure.Natural Language Processing plays an important role in structuring data. Sentimental Analysis is the interpretation and classification of emotions in positive,negative or neutral within the text data using text analysis techniques.