Media
Matrix Completion from Power-Law Distributed Samples
Meka, Raghu, Jain, Prateek, Dhillon, Inderjit S.
The low-rank matrix completion problem is a fundamental problem with many important applications. Recently, Candes & Recht, Keshavan et al. and Candes & Tao obtained the first non-trivial theoretical results for the problem assuming that the observed entries are sampled uniformly at random. Unfortunately, most real-world datasets do not satisfy this assumption, but instead exhibit power-law distributed samples. In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. Our method is easier to analyze than previous methods with the analysis reducing to computing the threshold for complete cascades in random graphs, a problem of independent interest. By analyzing the graph theoretic problem, we show that our method achieves exact recovery when the observed entries are sampled from the Chung-Lu-Vu model, which can generate power-law distributed graphs. We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. Furthermore, our method is easier to implement and is substantially faster than existing methods. We demonstrate the effectiveness of our method on examples when the low-rank matrix is sampled according to the prevalent random graph models for complex networks and also on the Netflix challenge dataset.
Learning transport operators for image manifolds
Culpepper, Benjamin, Olshausen, Bruno A.
We describe a method for learning a group of continuous transformation operators to traverse smooth nonlinear manifolds. The method is applied to model how natural images change over time and scale. The group of continuous transform operators is represented by a basis that is adapted to the statistics of the data so that the in๏ฌnitesimal generator for a measurement orbit can be produced by a linear combination of a few basis elements. We illustrate how the method can be used to ef๏ฌciently code time-varying images by describing changes across time and scale in terms of the learned operators.
MedLDA: A General Framework of Maximum Margin Supervised Topic Models
Zhu, Jun, Ahmed, Amr, Xing, Eric P.
Supervised topic models utilize document's side information for discovering predictive low dimensional representations of documents. Existing models apply the likelihood-based estimation. In this paper, we present a general framework of max-margin supervised topic models for both continuous and categorical response variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin principle to train supervised topic models and estimate predictive topic representations that are arguably more suitable for prediction tasks. The general principle of MedLDA can be applied to perform joint max-margin learning and maximum likelihood estimation for arbitrary topic models, directed or undirected, and supervised or unsupervised, when the supervised side information is available. We develop efficient variational methods for posterior inference and parameter estimation, and demonstrate qualitatively and quantitatively the advantages of MedLDA over likelihood-based topic models on movie review and 20 Newsgroups data sets.
A survey of statistical network models
Goldenberg, Anna, Zheng, Alice X, Fienberg, Stephen E, Airoldi, Edoardo M
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.
Why so? or Why no? Functional Causality for Explaining Query Answers
Meliou, Alexandra, Gatterbauer, Wolfgang, Moore, Katherine F., Suciu, Dan
In this paper, we propose causality as a unified framework to explain query answers and non-answers, thus generalizing and extending several previously proposed approaches of provenance and missing query result explanations. We develop our framework starting from the well-studied definition of actual causes by Halpern and Pearl [13]. After identifying some undesirable characteristics of the original definition, we propose functional causes as a refined definition of causality with several desirable properties. These properties allow us to apply our notion of causality in a database context and apply it uniformly to define the causes of query results and their individual contributions in several ways: (i) we can model both provenance as well as nonanswers, (ii) we can define explanations as either data in the input relations or relational operations in a query plan, and (iii) we can give graded degrees of responsibility to individual causes, thus allowing us to rank causes. In particular, our approach allows us to explain contributions to relational aggregate functions and to rank causes according to their respective responsibilities. We give complexity results and describe polynomial algorithms for evaluating causality in tractable cases. Throughout the paper, we illustrate the applicability of our framework with several examples. Overall, we develop in this paper the theoretical foundations of causality theory in a database context.
Discovering general partial orders in event streams
Achar, Avinash, Laxman, Srivatsan, Viswanathan, Raajay, Sastry, P. S.
Frequent episode discovery is a popular framework for pattern discovery in event streams. An episode is a partially ordered set of nodes with each node associated with an event type. Efficient (and separate) algorithms exist for episode discovery when the associated partial order is total (serial episode) and trivial (parallel episode). In this paper, we propose efficient algorithms for discovering frequent episodes with general partial orders. These algorithms can be easily specialized to discover serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in the space of certain interesting subclasses of partial orders. We point out that there is an inherent combinatorial explosion in frequent partial order mining and most importantly, frequency alone is not a sufficient measure of interestingness. We propose a new interestingness measure for general partial order episodes and a discovery method based on this measure, for filtering out uninteresting partial orders. Simulations demonstrate the effectiveness of our algorithms.
Social Networks and Social Information Filtering on Digg
The new social media sites -- blogs, wikis, Flickr and Digg, among others -- underscore the transformation of the Web to a participatory medium in which users are actively creating, evaluating and distributing information. Digg is a social news aggregator which allows users to submit links to, vote on and discuss news stories. Each day Digg selects a handful of stories to feature on its front page. Rather than rely on the opinion of a few editors, Digg aggregates opinions of thousands of its users to decide which stories to promote to the front page. Digg users can designate other users as ``friends'' and easily track friends' activities: what new stories they submitted, commented on or read. The friends interface acts as a \emph{social filtering} system, recommending to user stories his or her friends liked or found interesting. By tracking the votes received by newly submitted stories over time, we showed that social filtering is an effective information filtering approach. Specifically, we showed that (a) users tend to like stories submitted by friends and (b) users tend to like stories their friends read and liked. As a byproduct of social filtering, social networks also play a role in promoting stories to Digg's front page, potentially leading to ``tyranny of the minority'' situation where a disproportionate number of front page stories comes from the same small group of interconnected users. Despite this, social filtering is a promising new technology that can be used to personalize and tailor information to individual users: for example, through personal front pages.
Building and displaying name relations using automatic unsupervised analysis of newspaper articles
Pouliquen, Bruno, Steinberger, Ralf, Ignat, Camelia, Oellinger, Tamara
We present a tool that, from automatically recognised names, tries to infer inter-person relations in order to present associated people on maps. Based on an in-house Named Entity Recognition tool, applied on clusters of an average of 15,000 news articles per day, in 15 different languages, we build a knowledge base that allows extracting statistical co-occurrences of persons and visualising them on a per-person page or in various graphs.
A Knowledge-Based Approach for Selecting Information Sources
Eiter, Thomas, Fink, Michael, Tompits, Hans
Through the Internet and the World-Wide Web, a vast number of information sources has become available, which offer information on various subjects by different providers, often in heterogeneous formats. This calls for tools and methods for building an advanced information-processing infrastructure. One issue in this area is the selection of suitable information sources in query answering. In this paper, we present a knowledge-based approach to this problem, in the setting where one among a set of information sources (prototypically, data repositories) should be selected for evaluating a user query. We use extended logic programs (ELPs) to represent rich descriptions of the information sources, an underlying domain theory, and user queries in a formal query language (here, XML-QL, but other languages can be handled as well). Moreover, we use ELPs for declarative query analysis and generation of a query description. Central to our approach are declarative source-selection programs, for which we define syntax and semantics. Due to the structured nature of the considered data items, the semantics of such programs must carefully respect implicit context information in source-selection rules, and furthermore combine it with possible user preferences. A prototype implementation of our approach has been realized exploiting the DLV KR system and its plp front-end for prioritized ELPs. We describe a representative example involving specific movie databases, and report about experimental results.
Self-Regulated Artificial Ant Colonies on Digital Image Habitats
Fernandes, Carlos, Ramos, Vitorino, Rosa, Agostinho C.
Artificial life models, swarm intelligent and evolutionary computation algorithms are usually built on fixed size populations. Some studies indicate however that varying the population size can increase the adaptability of these systems and their capability to react to changing environments. In this paper we present an extended model of an artificial ant colony system designed to evolve on digital image habitats. We will show that the present swarm can adapt the size of the population according to the type of image on which it is evolving and reacting faster to changing images, thus converging more rapidly to the new desired regions, regulating the number of his image foraging agents. Finally, we will show evidences that the model can be associated with the Mathematical Morphology Watershed algorithm to improve the segmentation of digital grey-scale images. KEYWORDS: Swarm Intelligence, Perception and Image Processing, Pattern Recognition, Mathematical Morphology, Social Cognitive Maps, Social Foraging, Self-Organization, Distributed Search.