Goto

Collaborating Authors

Results


Graph Convolutional Topic Model for Data Streams

arXiv.org Machine Learning

Learning hidden topics in data streams has been paid a great deal of attention by researchers with a lot of proposed methods, but exploiting prior knowledge in general and a knowledge graph in particular has not been taken into adequate consideration in these methods. Prior knowledge that is derived from human knowledge (e.g. Wordnet) or a pre-trained model (e.g.Word2vec) is very valuable and useful to help topic models work better, especially on short texts. However, previous work often ignores this resource, or it can only utilize prior knowledge of a vector form in a simple way. In this paper, we propose a novel graph convolutional topic model (GCTM) which integrates graph convolutional networks (GCN) into a topic model and a learning method which learns the networks and the topic model simultaneously for data streams. In each minibatch, our method not only can exploit an external knowledge graph but also can balance between the external and old knowledge to perform well on new data. We conduct extensive experiments to evaluate our method with both human graph knowledge(Wordnet) and a graph built from pre-trained word embeddings (Word2vec). The experimental results show that our method achieves significantly better performances than the state-of-the-art baselines in terms of probabilistic predictive measure and topic coherence. In particular, our method can work well when dealing with short texts as well as concept drift. The implementation of GCTM is available at https://github.com/bachtranxuan/GCTM.git.


TraLFM: Latent Factor Modeling of Traffic Trajectory Data

arXiv.org Machine Learning

The widespread use of positioning devices (e.g., GPS) has given rise to a vast body of human movement data, often in the form of trajectories. Understanding human mobility patterns could benefit many location-based applications. In this paper, we propose a novel generative model called TraLFM via latent factor modeling to mine human mobility patterns underlying traffic trajectories. TraLFM is based on three key observations: (1) human mobility patterns are reflected by the sequences of locations in the trajectories; (2) human mobility patterns vary with people; and (3) human mobility patterns tend to be cyclical and change over time. Thus, TraLFM models the joint action of sequential, personal and temporal factors in a unified way, and brings a new perspective to many applications such as latent factor analysis and next location prediction. We perform thorough empirical studies on two real datasets, and the experimental results confirm that TraLFM outperforms the state-of-the-art methods significantly in these applications.


An Approach for Time-aware Domain-based Social Influence Prediction

arXiv.org Artificial Intelligence

Online Social Networks(OSNs) have established virtual platforms enabling people to express their opinions, interests and thoughts in a variety of contexts and domains, allowing legitimate users as well as spammers and other untrustworthy users to publish and spread their content. Hence, the concept of social trust has attracted the attention of information processors/data scientists and information consumers/business firms. One of the main reasons for acquiring the value of Social Big Data (SBD) is to provide frameworks and methodologies using which the credibility of OSNs users can be evaluated. These approaches should be scalable to accommodate large-scale social data. Hence, there is a need for well comprehending of social trust to improve and expand the analysis process and inferring the credibility of SBD. Given the exposed environment's settings and fewer limitations related to OSNs, the medium allows legitimate and genuine users as well as spammers and other low trustworthy users to publish and spread their content. Hence, this paper presents an approach incorporates semantic analysis and machine learning modules to measure and predict users' trustworthiness in numerous domains in different time periods. The evaluation of the conducted experiment validates the applicability of the incorporated machine learning techniques to predict highly trustworthy domain-based users.


A Correspondence Analysis Framework for Author-Conference Recommendations

arXiv.org Machine Learning

For many years, achievements and discoveries made by scientists are made aware through research papers published in appropriate journals or conferences. Often, established scientists and especially newbies are caught up in the dilemma of choosing an appropriate conference to get their work through. Every scientific conference and journal is inclined towards a particular field of research and there is a vast multitude of them for any particular field. Choosing an appropriate venue is vital as it helps in reaching out to the right audience and also to further one's chance of getting their paper published. In this work, we address the problem of recommending appropriate conferences to the authors to increase their chances of acceptance. We present three different approaches for the same involving the use of social network of the authors and the content of the paper in the settings of dimensionality reduction and topic modeling. In all these approaches, we apply Correspondence Analysis (CA) to derive appropriate relationships between the entities in question, such as conferences and papers. Our models show promising results when compared with existing methods such as content-based filtering, collaborative filtering and hybrid filtering.


Machine Learning Testing: Survey, Landscapes and Horizons

arXiv.org Artificial Intelligence

This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 128 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.


Representation Learning for Words and Entities

arXiv.org Artificial Intelligence

This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.


Automatic Language Identification in Texts: A Survey

Journal of Artificial Intelligence Research

Language identification (“LI”) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelfLI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.


Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

arXiv.org Artificial Intelligence

The quality of training data is one of the crucial problems when a learning-centered approach is employed. This paper proposes a new method to investigate the quality of a large corpus designed for the recognizing textual entailment (RTE) task. The proposed method, which is inspired by a statistical hypothesis test, consists of two phases: the first phase is to introduce the predictability of textual entailment labels as a null hypothesis which is extremely unacceptable if a target corpus has no hidden bias, and the second phase is to test the null hypothesis using a Naive Bayes model. The experimental result of the Stanford Natural Language Inference (SNLI) corpus does not reject the null hypothesis. Therefore, it indicates that the SNLI corpus has a hidden bias which allows prediction of textual entailment labels from hypothesis sentences even if no context information is given by a premise sentence. This paper also presents the performance impact of NN models for RTE caused by this hidden bias.


ClassiNet -- Predicting Missing Features for Short-Text Classification

arXiv.org Artificial Intelligence

The fundamental problem in short-text classification is \emph{feature sparseness} -- the lack of feature overlap between a trained model and a test instance to be classified. We propose \emph{ClassiNet} -- a network of classifiers trained for predicting missing features in a given instance, to overcome the feature sparseness problem. Using a set of unlabeled training instances, we first learn binary classifiers as feature predictors for predicting whether a particular feature occurs in a given instance. Next, each feature predictor is represented as a vertex $v_i$ in the ClassiNet where a one-to-one correspondence exists between feature predictors and vertices. The weight of the directed edge $e_{ij}$ connecting a vertex $v_i$ to a vertex $v_j$ represents the conditional probability that given $v_i$ exists in an instance, $v_j$ also exists in the same instance. We show that ClassiNets generalize word co-occurrence graphs by considering implicit co-occurrences between features. We extract numerous features from the trained ClassiNet to overcome feature sparseness. In particular, for a given instance $\vec{x}$, we find similar features from ClassiNet that did not appear in $\vec{x}$, and append those features in the representation of $\vec{x}$. Moreover, we propose a method based on graph propagation to find features that are indirectly related to a given short-text. We evaluate ClassiNets on several benchmark datasets for short-text classification. Our experimental results show that by using ClassiNet, we can statistically significantly improve the accuracy in short-text classification tasks, without having to use any external resources such as thesauri for finding related features.


Continuous Semantic Topic Embedding Model Using Variational Autoencoder

arXiv.org Machine Learning

This paper proposes the continuous semantic topic embedding model (CSTEM) which finds latent topic variables in documents using continuous semantic distance function between the topics and the words by means of the variational autoencoder(VAE). The semantic distance could be represented by any symmetric bell-shaped geometric distance function on the Euclidean space, for which the Mahalanobis distance is used in this paper. In order for the semantic distance to perform more properly, we newly introduce an additional model parameter for each word to take out the global factor from this distance indicating how likely it occurs regardless of its topic. It certainly improves the problem that the Gaussian distribution which is used in previous topic model with continuous word embedding could not explain the semantic relation correctly and helps to obtain the higher topic coherence. Through the experiments with the dataset of 20 Newsgroup, NIPS papers and CNN/Dailymail corpus, the performance of the recent state-of-the-art models is accomplished by our model as well as generating topic embedding vectors which makes possible to observe where the topic vectors are embedded with the word vectors in the real Euclidean space and how the topics are related each other semantically.