Supervised Learning
On the ERM Principle with Networked Data
Wang, Yuanhong, Wang, Yuyi, Liu, Xingwu, Pu, Juhua
Networked data, in which every training example involves two objects and may share some common objects with others, is used in many machine learning tasks such as learning to rank and link prediction. A challenge of learning from networked examples is that target values are not known for some pairs of objects. In this case, neither the classical i.i.d.\ assumption nor techniques based on complete U-statistics can be used. Most existing theoretical results of this problem only deal with the classical empirical risk minimization (ERM) principle that always weights every example equally, but this strategy leads to unsatisfactory bounds. We consider general weighted ERM and show new universal risk bounds for this problem. These new bounds naturally define an optimization problem which leads to appropriate weights for networked examples. Though this optimization problem is not convex in general, we devise a new fully polynomial-time approximation scheme (FPTAS) to solve it.
Training large margin host-pathogen protein-protein interaction predictors
Basit, Abdul Hannan, Abbasi, Wajid Arshad, Asif, Amina, Minhas, Fayyaz Ul Amir Afsar
Detection of protein-protein interactions (PPIs) plays a vital role in molecular biology. Particularly, infections are caused by the interactions of host and pathogen proteins. It is important to identify host-pathogen interactions (HPIs) to discover new drugs to counter infectious diseases. Conventional wet lab PPI prediction techniques have limitations in terms of large scale application and budget. Hence, computational approaches are developed to predict PPIs. This study aims to develop large margin machine learning models to predict interspecies PPIs with a special interest in host-pathogen protein interactions (HPIs). Especially, we focus on seeking answers to three queries that arise while developing an HPI predictor. 1) How should we select negative samples? 2) What should be the size of negative samples as compared to the positive samples? 3) What type of margin violation penalty should be used to train the predictor? We compare two available methods for negative sampling. Moreover, we propose a new method of assigning weights to each training example in weighted SVM depending on the distance of the negative examples from the positive examples. We have also developed a web server for our HPI predictor called HoPItor (Host Pathogen Interaction predicTOR) that can predict interactions between human and viral proteins. This webserver can be accessed at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#HoPItor.
North Dakota Museum Property Rights Case Set to Trial
The case was considered in district court in 2014. The next year, the North Dakota Legislature rejected a bill that would have sided with the historical society and allowed the museum to stay on the fairgrounds. The case returned to district court in 2015, but the original judge recused himself at the end of last year.
Traversing Knowledge Graph in Vector Space without Symbolic Space Guidance
Shen, Yelong, Huang, Po-Sen, Chang, Ming-Wei, Gao, Jianfeng
Recent studies on knowledge base completion, the task of recovering missing facts based on observed facts, demonstrate the importance of learning embeddings from multi-step relations. Due to the size of knowledge bases, previous works manually design relation paths of observed triplets in symbolic space (e.g. random walk) to learn multi-step relations during training. However, these approaches suffer some limitations as most paths are not informative, and it is prohibitively expensive to consider all possible paths. To address the limitations, we propose learning to traverse in vector space directly without the need of symbolic space guidance. To remember the connections between related observed triplets and be able to adaptively change relation paths in vector space, we propose Implicit ReasoNets (IRNs), that is composed of a global memory and a controller module to learn multi-step relation paths in vector space and infer missing facts jointly without any human-designed procedure. Without using any axillary information, our proposed model achieves state-of-the-art results on popular knowledge base completion benchmarks.
Entity Embeddings with Conceptual Subspaces as a Basis for Plausible Reasoning
Jameel, Shoaib, Schockaert, Steven
Conceptual spaces are geometric representations of conceptual knowledge, in which entities correspond to points, natural properties correspond to convex regions, and the dimensions of the space correspond to salient features. While conceptual spaces enable elegant models of various cognitive phenomena, the lack of automated methods for constructing such representations have so far limited their application in artificial intelligence. To address this issue, we propose a method which learns a vector-space embedding of entities from Wikipedia and constrains this embedding such that entities of the same semantic type are located in some lower-dimensional subspace. We experimentally demonstrate the usefulness of these subspaces as (approximate) conceptual space representations by showing, among others, that important features can be modelled as directions and that natural properties tend to correspond to convex regions.
Classification on Large Networks: A Quantitative Bound via Motifs and Graphons
Haupt, Andreas, Khatami, Mohammad, Schultz, Thomas, Tran, Ngoc Mai
When each data point is a large graph, graph statistics such as densities of certain subgraphs (motifs) can be used as feature vectors for machine learning. While intuitive, motif counts are expensive to compute and difficult to work with theoretically. Via graphon theory, we give an explicit quantitative bound for the ability of motif homomorphisms to distinguish large networks under both generative and sampling noise. Furthermore, we give similar bounds for the graph spectrum and connect it to homomorphism densities of cycles. This results in an easily computable classifier on graph data with theoretical performance guarantee. Our method yields competitive results on classification tasks for the autoimmune disease Lupus Erythematosus.
Elliptical modeling and pattern analysis for perturbation models and classfication
Suthaharan, Shan, Shen, Weining
The characteristics (or numerical patterns) of a feature vector in the transform domain of a perturbation model differ significantly from those of its corresponding feature vector in the input domain. These differences - caused by the perturbation techniques used for the transformation of feature patterns - degrade the performance of machine learning techniques in the transform domain. In this paper, we proposed a nonlinear parametric perturbation model that transforms the input feature patterns to a set of elliptical patterns, and studied the performance degradation issues associated with random forest classification technique using both the input and transform domain features. Compared with the linear transformation such as Principal Component Analysis (PCA), the proposed method requires less statistical assumptions and is highly suitable for the applications such as data privacy and security due to the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible block-wise dimensionality reduction step in the proposed method to accommodate the possible high-dimensional data in modern applications. We evaluated the empirical performance of the proposed method on a network intrusion data set and a biological data set, and compared the results with PCA in terms of classification performance and data privacy protection (measured by the blind source separation attack and signal interference ratio). Both results confirmed the superior performance of the proposed elliptical transformation.
Deep Feature Learning for Graphs
Rossi, Ryan A., Zhou, Rong, Ahmed, Nesreen K.
This paper presents a general graph representation learning framework called DeepGL for learning deep node and edge representations from large (attributed) graphs. In particular, DeepGL begins by deriving a set of base features (e.g., graphlet features) and automatically learns a multi-layered hierarchical graph representation where each successive layer leverages the output from the previous layer to learn features of a higher-order. Contrary to previous work, DeepGL learns relational functions (each representing a feature) that generalize across-networks and therefore useful for graph-based transfer learning tasks. Moreover, DeepGL naturally supports attributed graphs, learns interpretable features, and is space-efficient (by learning sparse feature vectors). In addition, DeepGL is expressive, flexible with many interchangeable components, efficient with a time complexity of $\mathcal{O}(|E|)$, and scalable for large networks via an efficient parallel implementation. Compared with the state-of-the-art method, DeepGL is (1) effective for across-network transfer learning tasks and attributed graph representation learning, (2) space-efficient requiring up to 6x less memory, (3) fast with up to 182x speedup in runtime performance, and (4) accurate with an average improvement of 20% or more on many learning tasks.
An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists
Chazal, Frédéric, Michel, Bertrand
Topological Data Analysis (tda) is a recent and fast growing eld providing a set of new topological and geometric tools to infer relevant features for possibly complex data. This paper is a brief introduction, through a few selected topics, to basic fundamental and practical aspects of tda for non experts. 1 Introduction and motivation Topological Data Analysis (tda) is a recent eld that emerged from various works in applied (algebraic) topology and computational geometry during the rst decade of the century. Although one can trace back geometric approaches for data analysis quite far in the past, tda really started as a eld with the pioneering works of Edelsbrunner et al. (2002) and Zomorodian and Carlsson (2005) in persistent homology and was popularized in a landmark paper in 2009 Carlsson (2009). tda is mainly motivated by the idea that topology and geometry provide a powerful approach to infer robust qualitative, and sometimes quantitative, information about the structure of data-see, e.g. Chazal (2017). tda aims at providing well-founded mathematical, statistical and algorithmic methods to infer, analyze and exploit the complex topological and geometric structures underlying data that are often represented as point clouds in Euclidean or more general metric spaces. During the last few years, a considerable eort has been made to provide robust and ecient data structures and algorithms for tda that are now implemented and available and easy to use through standard libraries such as the Gudhi library (C++ and Python) Maria et al. (2014) and its R software interface Fasy et al. (2014a). Although it is still rapidly evolving, tda now provides a set of mature and ecient tools that can be used in combination or complementary to other data sciences tools. The tdapipeline. tda has recently known developments in various directions and application elds. There now exist a large variety of methods inspired by topological and geometric approaches. Providing a complete overview of all these existing approaches is beyond the scope of this introductory survey. However, most of them rely on the following basic and standard pipeline that will serve as the backbone of this paper: 1. The input is assumed to be a nite set of points coming with a notion of distance-or similarity between them. This distance can be induced by the metric in the ambient space (e.g. the Euclidean metric when the data are embedded in R d) or come as an intrinsic metric dened by a pairwise distance matrix. The denition of the metric on the data is usually given as an input or guided by the application. It is however important to notice that the choice of the metric may be critical to reveal interesting topological and geometric features of the data.
Supervised Learning with Indefinite Topological Kernels
Padellini, Tullia, Brutti, Pierpaolo
Topological Data Analysis (TDA) is a recent and growing branch of statistics devoted to the study of the shape of the data. In this work we investigate the predictive power of TDA in the context of supervised learning. Since topological summaries, most noticeably the Persistence Diagram, are typically defined in complex spaces, we adopt a kernel approach to translate them into more familiar vector spaces. We define a topological exponential kernel, we characterize it, and we show that, despite not being positive semi-definite, it can be successfully used in regression and classification tasks.