As electronically stored data grow in daily life, obtaining novel and relevant information becomes challenging in text mining. Thus people have sought statistical methods based on term frequency, matrix algebra, or topic modeling for text mining. Popular topic models have centered on one single text collection, which is deficient for comparative text analyses. We consider a setting where one can partition the corpus into subcollections. Each subcollection shares a common set of topics, but there exists relative variation in topic proportions among collections. Including any prior knowledge about the corpus (e.g. organization structure), we propose the compound latent Dirichlet allocation (cLDA) model, improving on previous work, encouraging generalizability, and depending less on user-input parameters. To identify the parameters of interest in cLDA, we study Markov chain Monte Carlo (MCMC) and variational inference approaches extensively, and suggest an efficient MCMC method. We evaluate cLDA qualitatively and quantitatively using both synthetic and real-world corpora. The usability study on some real-world corpora illustrates the superiority of cLDA to explore the underlying topics automatically but also model their connections and variations across multiple collections.
This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.
We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate the task of semantic similarity as a model comparison task in which we contrast a generative model which jointly models two sentences versus one that does not. We illustrate how this framework can be used for the Semantic Textual Similarity tasks using clear assumptions about how the embeddings of words are generated. We apply model comparison that utilises information criteria to address some of the shortcomings of Bayesian model comparison, whilst still penalising model complexity. We achieve competitive results by applying the proposed framework with an appropriate choice of likelihood on the STS datasets.
Asking questions is a pervasive human activity, but little is understood about what makes them difficult to answer. An analysis of a pair of large databases, of New York Times crosswords and questions from the quiz-show Jeopardy, establishes two orthogonal dimensions of question difficulty: obscurity (the rarity of the answer) and opacity (the indirectness of question cues, operationalized with word2vec). The importance of opacity, and the role of synergistic information in resolving it, suggests that accounts of difficulty in terms of prior expectations captures only a part of the question-asking process. A further regression analysis shows the presence of additional dimensions to question-asking: question complexity, the answer's local network density, cue intersection, and the presence of signal words. Our work shows how question-askers can help their interlocutors by using contextual cues, or, conversely, how a particular kind of unfamiliarity with the domain in question can make it harder for individuals to learn from others. Taken together, these results suggest how Bayesian models of question difficulty can be supplemented by process models and accounts of the heuristics individuals use to navigate conceptual spaces.
As a promising area in artificial intelligence, a new learning paradigm, called Small Sample Learning (SSL), has been attracting prominent research attention in the recent years. In this paper, we aim to present a survey to comprehensively introduce the current techniques proposed on this topic. Specifically, current SSL techniques can be mainly divided into two categories. The first category of SSL approaches can be called "concept learning", which emphasizes learning new concepts from only few related observations. The purpose is mainly to simulate human learning behaviors like recognition, generation, imagination, synthesis and analysis. The second category is called "experience learning", which usually co-exists with the large sample learning manner of conventional machine learning. This category mainly focuses on learning with insufficient samples, and can also be called small data learning in some literatures. More extensive surveys on both categories of SSL techniques are introduced and some neuroscience evidences are provided to clarify the rationality of the entire SSL regime, and the relationship with human learning process. Some discussions on the main challenges and possible future research directions along this line are also presented.
Within the realm of service robotics, researchers have placed a great amount of effort into learning motions and manipulations for task execution by robots. The task of robot learning is very broad, as it involves many tasks such as object detection, action recognition, motion planning, localization, knowledge representation and retrieval, and the intertwining of computer vision and machine learning techniques. In this paper, we focus on how knowledge can be gathered, represented, and reproduced to solve problems as done by researchers in the past decades. We discuss the problems which have existed in robot learning and the solutions, technologies or developments (if any) which have contributed to solving them. Specifically, we look at three broad categories involved in task representation and retrieval for robotics: 1) activity recognition from demonstrations, 2) scene understanding and interpretation, and 3) task representation in robotics - datasets and networks. Within each section, we discuss major breakthroughs and how their methods address present issues in robot learning and manipulation.
Visual analytics systems combine machine learning or other analytic techniques with interactive data visualization to promote sensemaking and analytical reasoning. It is through such techniques that people can make sense of large, complex data. While progress has been made, the tactful combination of machine learning and data visualization is still under-explored. This state-of-the-art report presents a summary of the progress that has been made by highlighting and synthesizing select research advances. Further, it presents opportunities and challenges to enhance the synergy between machine learning and visual analytics for impactful future research directions.
This paper proposes the continuous semantic topic embedding model (CSTEM) which finds latent topic variables in documents using continuous semantic distance function between the topics and the words by means of the variational autoencoder(VAE). The semantic distance could be represented by any symmetric bell-shaped geometric distance function on the Euclidean space, for which the Mahalanobis distance is used in this paper. In order for the semantic distance to perform more properly, we newly introduce an additional model parameter for each word to take out the global factor from this distance indicating how likely it occurs regardless of its topic. It certainly improves the problem that the Gaussian distribution which is used in previous topic model with continuous word embedding could not explain the semantic relation correctly and helps to obtain the higher topic coherence. Through the experiments with the dataset of 20 Newsgroup, NIPS papers and CNN/Dailymail corpus, the performance of the recent state-of-the-art models is accomplished by our model as well as generating topic embedding vectors which makes possible to observe where the topic vectors are embedded with the word vectors in the real Euclidean space and how the topics are related each other semantically.
Machine learning has become pervasive in multiple domains, impacting a wide variety of applications, such as knowledge discovery and data mining, natural language processing, information retrieval, computer vision, social and health informatics, ubiquitous computing, etc. Two essential problems of machine learning are how to generate features and how to acquire labels for machines to learn. Particularly, labeling large amount of data for each domain-specific problem can be very time consuming and costly. It has become a key obstacle in making learning protocols realistic in applications. In this paper, we will discuss how to use the existing general-purpose world knowledge to enhance machine learning processes, by enriching the features or reducing the labeling work. We start from the comparison of world knowledge with domain-specific knowledge, and then introduce three key problems in using world knowledge in learning processes, i.e., explicit and implicit feature representation, inference for knowledge linking and disambiguation, and learning with direct or indirect supervision. Finally we discuss the future directions of this research topic.
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.