Industry
Stochastic blockmodels with growing number of classes
Choi, David S., Wolfe, Patrick J., Airoldi, Edoardo M.
We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising a collection of Facebook profiles, resulting in block estimates that reveal residual structure.
Notes on a New Philosophy of Empirical Science
This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation. The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.
Online Learning: Stochastic and Constrained Adversaries
Rakhlin, Alexander, Sridharan, Karthik, Tewari, Ambuj
Learning theory has largely focused on two main learning scenarios. The first is the classical statistical setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adversarial scenario where adversary at every time step picks the worst instance to provide the learner with. It can be argued that in the real world neither of these assumptions are reasonable. It is therefore important to study problems with a range of assumptions on data. Unfortunately, theoretical results in this area are scarce, possibly due to absence of general tools for analysis. Focusing on the regret formulation, we define the minimax value of a game where the adversary is restricted in his moves. The framework captures stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We then consider the i.i.d. adversary and show equivalence of online and batch learnability. In the supervised setting, we consider various hybrid assumptions on the way that x and y variables are chosen. Finally, we consider smoothed learning problems and show that half-spaces are online learnable in the smoothed model. In fact, exponentially small noise added to adversary's decisions turns this problem with infinite Littlestone's dimension into a learnable problem.
Synthesizing Robust Plans under Incomplete Domain Models
Nguyen, Tuan, Kambhampati, Subbarao, Do, Minh
Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to generate plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness, and formalize the notion of plan robustness with respect to an incomplete domain model. We then propose an approach to compiling the problem of finding robust plans to the conformant probabilistic planning problem.
From formulas to cirquents in computability logic
Computability logic (CoL) (see http://www.cis.upenn.edu/~giorgi/cl.html) is a recently introduced semantical platform and ambitious program for redeveloping logic as a formal theory of computability, as opposed to the formal theory of truth that logic has more traditionally been. Its expressions represent interactive computational tasks seen as games played by a machine against the environment, and "truth" is understood as existence of an algorithmic winning strategy. With logical operators standing for operations on games, the formalism of CoL is open-ended, and has already undergone series of extensions. This article extends the expressive power of CoL in a qualitatively new way, generalizing formulas (to which the earlier languages of CoL were limited) to circuit-style structures termed cirquents. The latter, unlike formulas, are able to account for subgame/subtask sharing between different parts of the overall game/task. Among the many advantages offered by this ability is that it allows us to capture, refine and generalize the well known independence-friendly logic which, after the present leap forward, naturally becomes a conservative fragment of CoL, just as classical logic had been known to be a conservative fragment of the formula-based version of CoL. Technically, this paper is self-contained, and can be read without any prior familiarity with CoL.
Combining Ontology Development Methodologies and Semantic Web Platforms for E-government Domain Ontology Development
Dombeu, Jean Vincent Fonou, Huisman, Magda
One of the key challenges in electronic government (e-government) is the development of systems that can be easily integrated and interoperated to provide seamless services delivery to citizens. In recent years, Semantic Web technologies based on ontology have emerged as promising solutions to the above engineering problems. However, current research practicing semantic development in e-government does not focus on the application of available methodologies and platforms for developing government domain ontologies. Furthermore, only a few of these researches provide detailed guidelines for developing semantic ontology models from a government service domain. This research presents a case study combining an ontology building methodology and two state-of-the-art Semantic Web platforms namely Protege and Java Jena ontology API for semantic ontology development in e-government. Firstly, a framework adopted from the Uschold and King ontology building methodology is employed to build a domain ontology describing the semantic content of a government service domain. Thereafter, UML is used to semi-formally represent the domain ontology. Finally, Protege and Jena API are employed to create the Web Ontology Language (OWL) and Resource Description Framework (RDF) representations of the domain ontology respectively to enable its computer processing. The study aims at: (1) providing e-government developers, particularly those from the developing world with detailed guidelines for practicing semantic content development in their e-government projects and (2), strengthening the adoption of semantic technologies in e-government. The study would also be of interest to novice Semantic Web developers who might used it as a starting point for further investigations.
Compressive Network Analysis
Jiang, Xiaoye, Yao, Yuan, Liu, Han, Guibas, Leonidas
Modern data acquisition routinely produces massive amounts of network data. Though many methods and models have been proposed to analyze such data, the research of network data is largely disconnected with the classical theory of statistical learning and signal processing. In this paper, we present a new framework for modeling network data, which connects two seemingly different areas: network data analysis and compressed sensing. From a nonparametric perspective, we model an observed network using a large dictionary. In particular, we consider the network clique detection problem and show connections between our formulation with a new algebraic tool, namely Randon basis pursuit in homogeneous spaces. Such a connection allows us to identify rigorous recovery conditions for clique detection problems. Though this paper is mainly conceptual, we also develop practical approximation algorithms for solving empirical problems and demonstrate their usefulness on real-world datasets.
Robust Clustering Using Outlier-Sparsity Regularization
Forero, Pedro A., Kekatos, Vassilis, Giannakis, Georgios B.
Notwithstanding the popularity of conventional clustering algorithms such as K-means and probabilistic clustering, their clustering results are sensitive to the presence of outliers in the data. Even a few outliers can compromise the ability of these algorithms to identify meaningful hidden structures rendering their outcome unreliable. This paper develops robust clustering algorithms that not only aim to cluster the data, but also to identify the outliers. The novel approaches rely on the infrequent presence of outliers in the data which translates to sparsity in a judiciously chosen domain. Capitalizing on the sparsity in the outlier domain, outlier-aware robust K-means and probabilistic clustering approaches are proposed. Their novelty lies on identifying outliers while effecting sparsity in the outlier domain through carefully chosen regularization. A block coordinate descent approach is developed to obtain iterative algorithms with convergence guarantees and small excess computational complexity with respect to their non-robust counterparts. Kernelized versions of the robust clustering algorithms are also developed to efficiently handle high-dimensional data, identify nonlinearly separable clusters, or even cluster objects that are not represented by vectors. Numerical tests on both synthetic and real datasets validate the performance and applicability of the novel algorithms.
Exploiting Structure in Weighted Model Counting Approaches to Probabilistic Inference
Li, W., Poupart, P., van Beek, P.
Previous studies have demonstrated that encoding a Bayesian network into a SAT formula and then performing weighted model counting using a backtracking search algorithm can be an effective method for exact inference. In this paper, we present techniques for improving this approach for Bayesian networks with noisy-OR and noisy-MAX relations-- two relations that are widely used in practice as they can dramatically reduce the number of probabilities one needs to specify. In particular, we present two SAT encodings for noisy-OR and two encodings for noisy-MAX that exploit the structure or semantics of the relations to improve both time and space efficiency, and we prove the correctness of the encodings. We experimentally evaluated our techniques on large-scale real and randomly generated Bayesian networks. On these benchmarks, our techniques gave speedups of up to two orders of magnitude over the best previous approaches for networks with noisy-OR/MAX relations and scaled up to larger networks. As well, our techniques extend the weighted model counting approach for exact inference to networks that were previously intractable for the approach.
Simultaneous model-based clustering and visualization in the Fisher discriminative subspace
Bouveyron, Charles, Brunet, Camille
Clustering in high-dimensional spaces is nowadays a recurrent problem in many scientific domains but remains a difficult task from both the clustering accuracy and the result understanding points of view. This paper presents a discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimension of the original space. By constraining model parameters within and between groups, a family of 12 parsimonious DLM models is exhibited which allows to fit onto various situations. An estimation algorithm, called the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters and the discriminative subspace. Experiments on simulated and real datasets show that the proposed approach performs better than existing clustering methods while providing a useful representation of the clustered data. The method is as well applied to the clustering of mass spectrometry data.