This thesis develops a conceptual framework considering social data as representing the surface layer of a hierarchy of human social behaviours, needs and cognition which is employed to transform social data into representations that preserve social behaviours and their causalities. Based on this framework two platforms were built to capture insights from fast-paced and slow-paced social data. For fast-paced, a self-structuring and incremental learning technique was developed to automatically capture salient topics and corresponding dynamics over time. An event detection technique was developed to automatically monitor those identified topic pathways for significant fluctuations in social behaviours using multiple indicators such as volume and sentiment. This platform is demonstrated using two large datasets with over 1 million tweets. The separated topic pathways were representative of the key topics of each entity and coherent against topic coherence measures. Identified events were validated against contemporary events reported in news. Secondly for the slow-paced social data, a suite of new machine learning and natural language processing techniques were developed to automatically capture self-disclosed information of the individuals such as demographics, emotions and timeline of personal events. This platform was trialled on a large text corpus of over 4 million posts collected from online support groups. This was further extended to transform prostate cancer related online support group discussions into a multidimensional representation and investigated the self-disclosed quality of life of patients (and partners) against time, demographics and clinical factors. The capabilities of this extended platform have been demonstrated using a text corpus collected from 10 prostate cancer online support groups comprising of 609,960 prostate cancer discussions and 22,233 patients.
Google app market captures the school of thought of users via ratings and text reviews. The critique's viewpoint regarding an app is proportional to their satisfaction level. Consequently, this helps other users to gain insights before downloading or purchasing the apps. The potential information from the reviews can't be extracted manually, due to its exponential growth. Sentiment analysis, by machine learning algorithms employing NLP, is used to explicitly uncover and interpret the emotions. This study aims to perform the sentiment classification of the app reviews and identify the university students' behavior towards the app market. We applied machine learning algorithms using the TF-IDF text representation scheme and the performance was evaluated on the ensemble learning method. Our model was trained on Google reviews and tested on students' reviews. SVM recorded the maximum accuracy(93.37\%), F-score(0.88) on tri-gram + TF-IDF scheme. Bagging enhanced the performance of LR and NB with accuracy of 87.80\% and 85.5\% respectively.
To build Sounding Board, we develop a system architecture that is capable of accommodating dialog strategies that we designed for socialbot conversations. The architecture consists of a multi-dimensional language understanding module for analyzing user utterances, a hierarchical dialog management framework for dialog context tracking and complex dialog control, and a language generation process that realizes the response plan and makes adjustments for speech synthesis. Additionally, we construct a new knowledge base to power the socialbot by collecting social chat content from a variety of sources. An important contribution of the system is the synergy between the knowledge base and the dialog management, i.e., the use of a graph structure to organize the knowledge base that makes dialog control very efficient in bringing related content to the discussion. Using the data collected from Sounding Board during the competition, we carry out in-depth analyses of socialbot conversations and user ratings which provide valuable insights in evaluation methods for socialbots. We additionally investigate a new approach for system evaluation and diagnosis that allows scoring individual dialog segments in the conversation. Finally, observing that socialbots suffer from the issue of shallow conversations about topics associated with unstructured data, we study the problem of enabling extended socialbot conversations grounded on a document. To bring together machine reading and dialog control techniques, a graph-based document representation is proposed, together with methods for automatically constructing the graph. Using the graph-based representation, dialog control can be carried out by retrieving nodes or moving along edges in the graph. To illustrate the usage, a mixed-initiative dialog strategy is designed for socialbot conversations on news articles.
Messages in human conversations inherently convey emotions. The task of detecting emotions in textual conversations leads to a wide range of applications such as opinion mining in social networks. However, enabling machines to analyze emotions in conversations is challenging, partly because humans often rely on the context and commonsense knowledge to express emotions. In this paper, we address these challenges by proposing a Knowledge-Enriched Transformer (KET), where contextual utterances are interpreted using hierarchical self-attention and external commonsense knowledge is dynamically leveraged using a context-aware affective graph attention mechanism. Experiments on multiple textual conversation datasets demonstrate that both context and commonsense knowledge are consistently beneficial to the emotion detection performance. In addition, the experimental results show that our KET model outperforms the state-of-the-art models on most of the tested datasets in F1 score.
Sentiment analysis (SA) is a task related to understanding people's feelings in written text; the starting point would be to identify the polarity level (positive, neutral or negative) of a given text, moving on to identify emotions or whether a text is humorous or not. This task has been the subject of several research competitions in a number of languages, e.g., English, Spanish, and Arabic, among others. In this contribution, we propose an SA system, namely EvoMSA, that unifies our participating systems in various SA competitions, making it domain independent and multilingual by processing text using only language-independent techniques. EvoMSA is a classifier, based on Genetic Programming, that works by combining the output of different text classifiers and text models to produce the final prediction. We analyze EvoMSA, with its parameters fixed, on different SA competitions to provide a global overview of its performance, and as the results show, EvoMSA is competitive obtaining top rankings in several SA competitions. Furthermore, we performed an analysis of EvoMSA's components to measure their contribution to the performance; the idea is to facilitate a practitioner or newcomer to implement a competitive SA classifier. Finally, it is worth to mention that EvoMSA is available as open source software.
Word embeddings have been widely used in sentiment classification because of their efficacy for semantic representations of words. Given reviews from different domains, some existing methods for word embeddings exploit sentiment information, but they cannot produce domain-sensitive embeddings. On the other hand, some other existing methods can generate domain-sensitive word embeddings, but they cannot distinguish words with similar contexts but opposite sentiment polarity. We propose a new method for learning domain-sensitive and sentiment-aware embeddings that simultaneously capture the information of sentiment semantics and domain sensitivity of individual words. Our method can automatically determine and produce domain-common embeddings and domain-specific embeddings. The differentiation of domain-common and domain-specific words enables the advantage of data augmentation of common semantics from multiple domains and capture the varied semantics of specific words from different domains at the same time. Experimental results show that our model provides an effective way to learn domain-sensitive and sentiment-aware word embeddings which benefit sentiment classification at both sentence level and lexicon term level.
Any-gram kernels are a flexible and efficient way to employ bag-of-n-gram features when learning from textual data. They are also compatible with the use of word embeddings so that word similarities can be accounted for. While the original any-gram kernels are implemented on top of tree kernels, we propose a new approach which is independent of tree kernels and is more efficient. We also propose a more effective way to make use of word embeddings than the original any-gram formulation. When applied to the task of sentiment classification, our new formulation achieves significantly better performance.
Automated sentiment analysis and opinion mining is a complex process concerning the extraction of useful subjective information from text. The explosion of user generated content on the Web, especially the fact that millions of users, on a daily basis, express their opinions on products and services to blogs, wikis, social networks, message boards, etc., render the reliable, automated export of sentiments and opinions from unstructured text crucial for several commercial applications. In this paper, we present a novel hybrid vectorization approach for textual resources that combines a weighted variant of the popular Word2Vec representation (based on Term Frequency-Inverse Document Frequency) representation and with a Bag- of-Words representation and a vector of lexicon-based sentiment values. The proposed text representation approach is assessed through the application of several machine learning classification algorithms on a dataset that is used extensively in literature for sentiment detection. The classification accuracy derived through the proposed hybrid vectorization approach is higher than when its individual components are used for text represenation, and comparable with state-of-the-art sentiment detection methodologies.
Recursive neural models have achieved promising results in many natural language processing tasks. The main difference among these models lies in the composition function, i.e., how to obtain the vector representation for a phrase or sentence using the representations of words it contains. This paper introduces a novel Adaptive Multi-Compositionality (AdaMC) layer to recursive neural models. The basic idea is to use more than one composition functions and adaptively select them depending on the input vectors. We present a general framework to model each semantic composition as a distribution over these composition functions. The composition functions and parameters used for adaptive selection are learned jointly from data. We integrate AdaMC into existing recursive neural models and conduct extensive experiments on the Stanford Sentiment Treebank. The results illustrate that AdaMC significantly outperforms state-of-the-art sentiment classification methods. It helps push the best accuracy of sentence-level negative/positive classification from 85.4% up to 88.5%.