Collaborating Authors

Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application Machine Learning

In this paper we address the problem of discovering a small set of frequent serial episodes from sequential data so as to adequately characterize or summarize the data. We discuss an algorithm based on the Minimum Description Length (MDL) principle and the algorithm is a slight modification of an earlier method, called CSC-2. We present a novel generative model for sequence data containing prominent pairs of serial episodes and, using this, provide some statistical justification for the algorithm. We believe this is the first instance of such a statistical justification for an MDL based algorithm for summarizing event sequence data. We then present a novel application of this data mining algorithm in text classification. By considering text documents as temporal sequences of words, the data mining algorithm can find a set of characteristic episodes for all the training data as a whole. The words that are part of these characteristic episodes could then be considered the only relevant words for the dictionary thus resulting in a considerably reduced feature vector dimension. We show, through simulation experiments using benchmark data sets, that the discovered frequent episodes can be used to achieve more than four-fold reduction in dictionary size without losing any classification accuracy.

Exploring Partially Observed Networks with Nonparametric Bandits Machine Learning

Real-world networks such as social and communication networks are too large to be observed entirely. Such networks are often partially observed such that network size, network topology, and nodes of the original network are unknown. In this paper we formalize the Adaptive Graph Exploring problem. We assume that we are given an incomplete snapshot of a large network and additional nodes can be discovered by querying nodes in the currently observed network. The goal of this problem is to maximize the number of observed nodes within a given query budget. Querying which set of nodes maximizes the size of the observed network? We formulate this problem as an exploration-exploitation problem and propose a novel nonparametric multi-arm bandit (MAB) algorithm for identifying which nodes to be queried. Our contributions include: (1) $i$KNN-UCB, a novel nonparametric MAB algorithm, applies $k$-nearest neighbor UCB to the setting when the arms are presented in a vector space, (2) provide theoretical guarantee that $i$KNN-UCB algorithm has sublinear regret, and (3) applying $i$KNN-UCB algorithm on synthetic networks and real-world networks from different domains, we show that our method discovers up to 40% more nodes compared to existing baselines.

Global and Local Feature Learning for Ego-Network Analysis Machine Learning

In an ego-network, an individual (ego) organizes its friends (alters) in different groups (social circles). This social network can be efficiently analyzed after learning representations of the ego and its alters in a low-dimensional, real vector space. These representations are then easily exploited via statistical models for tasks such as social circle detection and prediction. Recent advances in language modeling via deep learning have inspired new methods for learning network representations. These methods can capture the global structure of networks. In this paper, we evolve these techniques to also encode the local structure of neighborhoods. Therefore, our local representations capture network features that are hidden in the global representation of large networks. We show that the task of social circle prediction benefits from a combination of global and local features generated by our technique.

Embedding Geographic Locations for Modelling the Natural Environment using Flickr Tags and Structured Data Machine Learning

Meta-data from photo-sharing websites such as Flickr can be used to obtain rich bag-of-words descriptions of geographic locations, which have proven valuable, among others, for modelling and predicting ecological features. One important insight from previous work is that the descriptions obtained from Flickr tend to be complementary to the structured information that is available from traditional scientific resources. To better integrate these two diverse sources of information, in this paper we consider a method for learning vector space embeddings of geographic locations. We show experimentally that this method improves on existing approaches, especially in cases where structured information is available.

Querying Complex Networks in Vector Space Machine Learning

Learning vector embeddings of complex networks is a powerful approach used to predict missing or unobserved edges in network data. However, an open challenge in this area is developing techniques that can reason about $\textit{subgraphs}$ in network data, which can involve the logical conjunction of several edge relationships. Here we introduce a framework to make predictions about conjunctive logical queries---i.e., subgraph relationships---on heterogeneous network data. In our approach, we embed network nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. We prove that a small set of geometric operations are sufficient to represent conjunctive logical queries on a network, and we introduce a series of increasingly strong implementations of these operators. We demonstrate the utility of this framework in two application studies on networks with millions of edges: predicting unobserved subgraphs in a network of drug-gene-disease interactions and in a network of social interactions derived from a popular web forum. These experiments demonstrate how our framework can efficiently make logical predictions such as "what drugs are likely to target proteins involved with both diseases X and Y?" Together our results highlight how imposing logical structure can make network embeddings more useful for large-scale knowledge discovery.