Goto

Collaborating Authors

 Discourse & Dialogue


A Neural Autoregressive Topic Model

Neural Information Processing Systems

We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax.


Monte Carlo Methods for Maximum Margin Supervised Topic Models

Neural Information Processing Systems

An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihood-based supervised topic models, of which posterior inference can be carried out using the Bayes' rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.


Symmetric Correspondence Topic Models for Multilingual Text Analysis

Neural Information Processing Systems

Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA.


Lexical and Hierarchical Topic Regression

Neural Information Processing Systems

Inspired by a two-level theory that unifies agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (SHLDA) which jointly captures documents' multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant process to discover a tree-structured topic hierarchy and uses both per-topic hierarchical and per-word lexical regression parameters to model the response variables. Experiments in a political domain and on sentiment analysis tasks show that SHLDA improves predictive accuracy while adding a new dimension of insight into how topics under discussion are framed.


How to Create a Sentiment Analysis Model From Scratch

#artificialintelligence

Sentiment analysis is a natural language processing (NLP) technique that identifies the attitude behind a text. It is also known as opinion mining. The goal of sentiment analysis is to identify whether a certain text has positive, negative, or neutral sentiment. It is widely used by businesses to automatically classify the sentiment in customer reviews. Analyzing large volumes of reviews helps gain valuable insights into the customers' preferences.


Polarity based Sarcasm Detection using Semigraph

arXiv.org Artificial Intelligence

Sarcasm is an advanced linguistic expression often found on various online platforms. Sarcasm detection is challenging in natural language processing tasks that affect sentiment analysis. This article presents the inventive method of the semigraph, including semigraph construction and sarcasm detection processes. A variation of the semigraph is suggested in the pattern-relatedness of the text document. The proposed method is to obtain the sarcastic and non-sarcastic polarity scores of a document using a semigraph. The sarcastic polarity score represents the possibility that a document will become sarcastic. Sarcasm is detected based on the polarity scoring model. The performance of the proposed model enhances the existing prior art approach to sarcasm detection. In the Amazon product review, the model achieved the accuracy, recall, and f-measure of 0.87, 0.79, and 0.83, respectively.


Negativity Spreads Faster: A Large-Scale Multilingual Twitter Analysis on the Role of Sentiment in Political Communication

arXiv.org Artificial Intelligence

Social media has become extremely influential when it comes to policy making in modern societies, especially in the western world, where platforms such as Twitter allow users to follow politicians, thus making citizens more involved in political discussion. In the same vein, politicians use Twitter to express their opinions, debate among others on current topics and promote their political agendas aiming to influence voter behaviour. In this paper, we attempt to analyse tweets of politicians from three European countries and explore the virality of their tweets. Previous studies have shown that tweets conveying negative sentiment are likely to be retweeted more frequently. By utilising state-of-the-art pre-trained language models, we performed sentiment analysis on hundreds of thousands of tweets collected from members of parliament in Greece, Spain and the United Kingdom, including devolved administrations. We achieved this by systematically exploring and analysing the differences between influential and less popular tweets. Our analysis indicates that politicians' negatively charged tweets spread more widely, especially in more recent times, and highlights interesting differences between political parties as well as between politicians and the general population.


Words that Wound: The Impact of Biased Language on News Sentiment and Stock Market Index

arXiv.org Artificial Intelligence

This study investigates the impact of biased language, specifically 'Words that Wound,' on sentiment analysis in a dataset of 45,379 South Korean daily economic news articles. Using Word2Vec, cosine similarity, and an expanded lexicon, we analyzed the influence of these words on news titles' sentiment scores. Our findings reveal that incorporating biased language significantly amplifies sentiment scores' intensity, particularly negativity. The research examines the effect of heightened negativity in news titles on the KOSPI200 index using linear regression and sentiment analysis. Results indicate that the augmented sentiment lexicon (Sent1000), which includes the top 1,000 negative words with high cosine similarity to 'Crisis,' more effectively captures the impact of news sentiment on the stock market index than the original KNU sentiment lexicon (Sent0). The ARDL model and Impulse Response Function (IRF) analyses disclose that Sent1000 has a stronger and more persistent impact on KOSPI200 compared to Sent0. These findings emphasize the importance of understanding language's role in shaping market dynamics and investor sentiment, particularly the impact of negatively biased language on stock market indices. The study highlights the need for considering context and linguistic nuances when analyzing news content and its potential effects on public opinion and market dynamics.


Classifying COVID-19 Related Tweets for Fake News Detection and Sentiment Analysis with BERT-based Models

arXiv.org Artificial Intelligence

The present paper is about the participation of our team "techno" on CERIST'22 shared tasks. We used an available dataset "task1.c" related to covid-19 pandemic. It comprises 4128 tweets for sentiment analysis task and 8661 tweets for fake news detection task. We used natural language processing tools with the combination of the most renowned pre-trained language models BERT (Bidirectional Encoder Representations from Transformers). The results shows the efficacy of pre-trained language models as we attained an accuracy of 0.93 for the sentiment analysis task and 0.90 for the fake news detection task.


When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue Corpus

arXiv.org Artificial Intelligence

Building a natural language dataset requires caution since word semantics is vulnerable to subtle text change or the definition of the annotated concept. Such a tendency can be seen in generative tasks like question-answering and dialogue generation and also in tasks that create a categorization-based corpus, like topic classification or sentiment analysis. Open-domain conversations involve two or more crowdworkers freely conversing about any topic, and collecting such data is particularly difficult for two reasons: 1) the dataset should be ``crafted" rather than ``obtained" due to privacy concerns, and 2) paid creation of such dialogues may differ from how crowdworkers behave in real-world settings. In this study, we tackle these issues when creating a large-scale open-domain persona dialogue corpus, where persona implies that the conversation is performed by several actors with a fixed persona and user-side workers from an unspecified crowd.