Goto

Collaborating Authors

 Discourse & Dialogue


Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

arXiv.org Artificial Intelligence

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.


Discovering Airline-Specific Business Intelligence from Online Passenger Reviews: An Unsupervised Text Analytics Approach

arXiv.org Artificial Intelligence

To understand the important dimensions of service quality from the passenger's perspective and tailor service offerings for competitive advantage, airlines can capitalize on the abundantly available online customer reviews (OCR). The objective of this paper is to discover company- and competitor-specific intelligence from OCR using an unsupervised text analytics approach. First, the key aspects (or topics) discussed in the OCR are extracted using three topic models - probabilistic latent semantic analysis (pLSA) and two variants of Latent Dirichlet allocation (LDA-VI and LDA-GS). Subsequently, we propose an ensemble-assisted topic model (EA-TM), which integrates the individual topic models, to classify each review sentence to the most representative aspect. Likewise, to determine the sentiment corresponding to a review sentence, an ensemble sentiment analyzer (E-SA), which combines the predictions of three opinion mining methods (AFINN, SentiStrength, and VADER), is developed. An aspect-based opinion summary (AOS), which provides a snapshot of passenger-perceived strengths and weaknesses of an airline, is established by consolidating the sentiments associated with each aspect. Furthermore, a bi-gram analysis of the labeled OCR is employed to perform root cause analysis within each identified aspect. A case study involving 99,147 airline reviews of a US-based target carrier and four of its competitors is used to validate the proposed approach. The results indicate that a cost- and time-effective performance summary of an airline and its competitors can be obtained from OCR. Finally, besides providing theoretical and managerial implications based on our results, we also provide implications for post-pandemic preparedness in the airline industry considering the unprecedented impact of coronavirus disease 2019 (COVID-19) and predictions on similar pandemics in the future.


"Thought I'd Share First": An Analysis of COVID-19 Conspiracy Theories and Misinformation Spread on Twitter

arXiv.org Machine Learning

Background: Misinformation spread through social media is a growing problem, and the emergence of COVID-19 has caused an explosion in new activity and renewed focus on the resulting threat to public health. Given this increased visibility, in-depth analysis of COVID-19 misinformation spread is critical to understanding the evolution of ideas with potential negative public health impact. Methods: Using a curated data set of COVID-19 tweets (N ~120 million tweets) spanning late January to early May 2020, we applied methods including regular expression filtering, supervised machine learning, sentiment analysis, geospatial analysis, and dynamic topic modeling to trace the spread of misinformation and to characterize novel features of COVID-19 conspiracy theories. Results: Random forest models for four major misinformation topics provided mixed results, with narrowly-defined conspiracy theories achieving F1 scores of 0.804 and 0.857, while more broad theories performed measurably worse, with scores of 0.654 and 0.347. Despite this, analysis using model-labeled data was beneficial for increasing the proportion of data matching misinformation indicators. We were able to identify distinct increases in negative sentiment, theory-specific trends in geospatial spread, and the evolution of conspiracy theory topics and subtopics over time. Conclusions: COVID-19 related conspiracy theories show that history frequently repeats itself, with the same conspiracy theories being recycled for new situations. We use a combination of supervised learning, unsupervised learning, and natural language processing techniques to look at the evolution of theories over the first four months of the COVID-19 outbreak, how these theories intertwine, and to hypothesize on more effective public health messaging to combat misinformation in online spaces.


Sentiment Analysis (Opinion Mining) with Python -- NLP Tutorial

#artificialintelligence

A "sentiment" is a generally binary opposition in opinions and expresses the feelings in the form of emotions, attitudes, opinions, and so on. It can express many opinions. By using machine learning methods and natural language processing, we can extract the personal information of a document and attempt to classify it according to its polarity, such as positive, neutral, or negative, making sentiment analysis instrumental in determining the overall opinion of a defined objective, for instance, a selling item or predicting stock markets for a given company. Sentiment analysis is challenging and far from being solved since most languages are highly complex (objectivity, subjectivity, negation, vocabulary, grammar, and others). However, that is what makes it exciting to working on [1].


Aspect Based Sentiment Analysis

#artificialintelligence

We live in a world which is more opinionated than ever. Any service that we consume leaves us either satisfied or unsatisfied. And with the advent of social media, we make our views public in no time. Vast sources of data are available in the form of reviews, customer satisfaction surveys, customer complaints, etc. Businesses can use this data to understand what customers are talking about, and make data driven decisions to improve their services. Let's talk in terms of Machine Learning now! Sentiment Analysis is the process of understanding how satisfied customers are w.r.t. a service.


The Top 5 Data Science Libraries

#artificialintelligence

There are several articles detailing beneficial Data Science libraries, as well as packages, platforms, and modules, so I am going to do my best in choosing not only the top libraries, but also ones that are unique in order to reduce redundancies. As a professional Data Scientist, I have not only heard that the data part of the process consumes up a lot of your time in everyday work, but I have also experienced it. Some of the libraries I will discuss will incorporate that in mind, like pandas_profiling. Additionally, I have worked not just with numeric data, but also with text data, which requires a lot of preprocessing and can be helped by libraries like nltk, textblob, and pyldavis. Lastly, some of these libraries work well as visualizations tools as well like networkx.


A Sentiment Analysis Approach to the Prediction of Market Volatility

arXiv.org Artificial Intelligence

Prediction and quantification of future volatility and returns play an important role in financial modelling, both in portfolio optimization and risk management. Natural language processing today allows to process news and social media comments to detect signals of investors' confidence. We have explored the relationship between sentiment extracted from financial news and tweets and FTSE100 movements. We investigated the strength of the correlation between sentiment measures on a given day and market volatility and returns observed the next day. The findings suggest that there is evidence of correlation between sentiment and stock market movements: the sentiment captured from news headlines could be used as a signal to predict market returns; the same does not apply for volatility. Also, in a surprising finding, for the sentiment found in Twitter comments we obtained a correlation coefficient of -0.7, and p-value below 0.05, which indicates a strong negative correlation between positive sentiment captured from the tweets on a given day and the volatility observed the next day. We developed an accurate classifier for the prediction of market volatility in response to the arrival of new information by deploying topic modelling, based on Latent Dirichlet Allocation, to extract feature vectors from a collection of tweets and financial news. The obtained features were used as additional input to the classifier. Thanks to the combination of sentiment and topic modelling our classifier achieved a directional prediction accuracy for volatility of 63%.


NLP with LDA (Latent Dirichlet Allocation) and Text Clustering to improve classification

#artificialintelligence

This section serves as a short reminder on what we are trying to do. CareerVillage, in its essence, is like Stackoverflow or Quora but for career questions. Users can post questions about any careers like computer science, pharmacology, aerospace engineering etc, and volunteer professionals try their best to answer the questions. When a new question comes in, CareerVillage recommends that question to a specific professional who is suitable to answer that question. In order to maximize the chance that a user's questions get answered, CareerVillage needs to send the right question to the most apt professional.


JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning

#artificialintelligence

This project's aim, is to explore the world of Natural Language Processing (NLP) by building what is known as a Sentiment Analysis Model. A sentiment analysis model is a model that analyses a given piece of text and predicts whether this piece of text expresses positive or negative sentiment. To this end, we will be using the sentiment140 dataset containing data collected from twitter. An impressive feature of this dataset is that it is perfectly balanced (i.e., the number of examples in each class is equal). Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets.


Sentiment Analysis for Stock Price Prediction in Python

#artificialintelligence

Now we have our API set up; we can begin pulling tweet data. We will focus on Tesla for this article. We will be using the requests library to interact with the Twitter API. We can search for the most recent tweets given a query through the /tweets/search/recent endpoint. We need two more parts before sending our request, (1) authorization and (2) a search query.