Country
Oracle inequalities for computationally adaptive model selection
Agarwal, Alekh, Bartlett, Peter L., Duchi, John C.
We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our omputational budget to the optimal function class.
Fast Planar Correlation Clustering for Image Segmentation
Yarkony, Julian, Ihler, Alexander T., Fowlkes, Charless C.
We describe a new optimization scheme for finding high-quality correlation clusterings in planar graphs that uses weighted perfect matching as a subroutine. Our method provides lower-bounds on the energy of the optimal correlation clustering that are typically fast to compute and tight in practice. We demonstrate our algorithm on the problem of image segmentation where this approach outperforms existing global optimization techniques in minimizing the objective and is competitive with the state of the art in producing high-quality segmentations.
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
Lange, Christoph, Mossakowski, Till, Kutz, Oliver, Galinski, Christian, Grüninger, Michael, Vale, Daniel Couto
The Distributed Ontology Language (DOL) is currently being standardized within the OntoIOp (Ontology Integration and Interoperability) activity of ISO/TC 37/SC 3. It aims at providing a unified framework for (1) ontologies formalized in heterogeneous logics, (2) modular ontologies, (3) links between ontologies, and (4) annotation of ontologies. This paper presents the current state of DOL's standardization. It focuses on use cases where distributed ontologies enable interoperability and reusability. We demonstrate relevant features of the DOL syntax and semantics and explain how these integrate into existing knowledge engineering environments.
Learning a peptide-protein binding affinity predictor with kernel ridge regression
Giguère, Sébastien, Marchand, Mario, Laviolette, François, Drouin, Alexandre, Corbeil, Jacques
We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalize eight kernels, such as the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it's approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of accurately predicting the binding affinity of any peptide to any protein. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. On all benchmarks, our method significantly (p-value < 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. The method should be of value to a large segment of the research community with the potential to accelerate peptide-based drug and vaccine development.
Decision Making for Symbolic Probability
Giang, Phan H., Sandilya, Sathyakama
This paper proposes a decision theory for a symbolic generalization of probability theory (SP). Darwiche and Ginsberg [2,3] proposed SP to relax the requirement of using numbers for uncertainty while preserving desirable patterns of Bayesian reasoning. SP represents uncertainty by symbolic supports that are ordered partially rather than completely as in the case of standard probability. We show that a preference relation on acts that satisfies a number of intuitive postulates is represented by a utility function whose domain is a set of pairs of supports. We argue that a subjective interpretation is as useful and appropriate for SP as it is for numerical probability. It is useful because the subjective interpretation provides a basis for uncertainty elicitation. It is appropriate because we can provide a decision theory that explains how preference on acts is based on support comparison.
PAC-Bayesian Inequalities for Martingales
Seldin, Yevgeny, Laviolette, François, Cesa-Bianchi, Nicolò, Shawe-Taylor, John, Auer, Peter
We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0,1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma's inequality.
Universally Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs
Sussman, Daniel L., Tang, Minh, Priebe, Carey E.
In this work we show that, using the eigen-decomposition of the adjacency matrix, we can consistently estimate latent positions for random dot product graphs provided the latent positions are i.i.d. from some distribution. If class labels are observed for a number of vertices tending to infinity, then we show that the remaining vertices can be classified with error converging to Bayes optimal using the $k$-nearest-neighbors classification rule. We evaluate the proposed methods on simulated data and a graph derived from Wikipedia.
High Dimensional Semiparametric Gaussian Copula Graphical Models
Liu, Han, Han, Fang, Yuan, Ming, Lafferty, John, Wasserman, Larry
In this paper, we propose a semiparametric approach, named nonparanormal skeptic, for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider Gaussian Copula graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This celebrating result suggests that the Gaussian copula graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare different estimators for their graph recovery performance under both ideal and noisy settings. The proposed methods are then applied on a large-scale genomic dataset to illustrate their empirical usefulness. The R language software package huge implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran. r-project.org/.
Diversity in Ranking using Negative Reinforcement
Badrinath, Rama, Madhavan, C. E. Veni
In this paper, we consider the problem of diversity in ranking of the nodes in a graph. The task is to pick the top-k nodes in the graph which are both 'central' and 'diverse'. Many graph-based models of NLP like text summarization, opinion summarization involve the concept of diversity in generating the summaries. We develop a novel method which works in an iterative fashion based on random walks to achieve diversity. Specifically, we use negative reinforcement as a main tool to introduce diversity in the Personalized PageRank framework. Experiments on two benchmark datasets show that our algorithm is competitive to the existing methods.
Identifying Users From Their Rating Patterns
Bento, José, Fawaz, Nadia, Montanari, Andrea, Ioannidis, Stratis
This paper reports on our analysis of the 2011 CAMRa Challenge dataset (Track 2) for context-aware movie recommendation systems. The train dataset comprises 4,536,891 ratings provided by 171,670 users on 23,974$ movies, as well as the household groupings of a subset of the users. The test dataset comprises 5,450 ratings for which the user label is missing, but the household label is provided. The challenge required to identify the user labels for the ratings in the test set. Our main finding is that temporal information (time labels of the ratings) is significantly more useful for achieving this objective than the user preferences (the actual ratings). Using a model that leverages on this fact, we are able to identify users within a known household with an accuracy of approximately 96% (i.e. misclassification rate around 4%).