Inductive Learning
Learning to Rank based on Analogical Reasoning
Fahandar, Mohsen Ahmadi, Hüllermeier, Eyke
Object ranking or "learning to rank" is an important problem in the realm of preference learning. On the basis of training data in the form of a set of rankings of objects represented as feature vectors, the goal is to learn a ranking function that predicts a linear order of any new set of objects. In this paper, we propose a new approach to object ranking based on principles of analogical reasoning. More specifically, our inference pattern is formalized in terms of so-called analogical proportions and can be summarized as follows: Given objects $A,B,C,D$, if object $A$ is known to be preferred to $B$, and $C$ relates to $D$ as $A$ relates to $B$, then $C$ is (supposedly) preferred to $D$. Our method applies this pattern as a main building block and combines it with ideas and techniques from instance-based learning and rank aggregation. Based on first experimental results for data sets from various domains (sports, education, tourism, etc.), we conclude that our approach is highly competitive. It appears to be specifically interesting in situations in which the objects are coming from different subdomains, and which hence require a kind of knowledge transfer.
Relief-Based Feature Selection: Introduction and Review
Urbanowicz, Ryan J., Meeker, Melissa, LaCava, William, Olson, Randal S., Moore, Jason H.
Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that strike an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
Hypergraph $p$-Laplacian: A Differential Geometry View
Saito, Shota, Mandic, Danilo P, Suzuki, Hideyuki
The graph Laplacian plays key roles in information processing of relational data, and has analogies with the Laplacian in differential geometry. In this paper, we generalize the analogy between graph Laplacian and differential geometry to the hypergraph setting, and propose a novel hypergraph $p$-Laplacian. Unlike the existing two-node graph Laplacians, this generalization makes it possible to analyze hypergraphs, where the edges are allowed to connect any number of nodes. Moreover, we propose a semi-supervised learning method based on the proposed hypergraph $p$-Laplacian, and formalize them as the analogue to the Dirichlet problem, which often appears in physics. We further explore theoretical connections to normalized hypergraph cut on a hypergraph, and propose normalized cut corresponding to hypergraph $p$-Laplacian. The proposed $p$-Laplacian is shown to outperform standard hypergraph Laplacians in the experiment on a hypergraph semi-supervised learning and normalized cut setting.
On the ERM Principle with Networked Data
Wang, Yuanhong, Wang, Yuyi, Liu, Xingwu, Pu, Juhua
Networked data, in which every training example involves two objects and may share some common objects with others, is used in many machine learning tasks such as learning to rank and link prediction. A challenge of learning from networked examples is that target values are not known for some pairs of objects. In this case, neither the classical i.i.d.\ assumption nor techniques based on complete U-statistics can be used. Most existing theoretical results of this problem only deal with the classical empirical risk minimization (ERM) principle that always weights every example equally, but this strategy leads to unsatisfactory bounds. We consider general weighted ERM and show new universal risk bounds for this problem. These new bounds naturally define an optimization problem which leads to appropriate weights for networked examples. Though this optimization problem is not convex in general, we devise a new fully polynomial-time approximation scheme (FPTAS) to solve it.
Training large margin host-pathogen protein-protein interaction predictors
Basit, Abdul Hannan, Abbasi, Wajid Arshad, Asif, Amina, Minhas, Fayyaz Ul Amir Afsar
Detection of protein-protein interactions (PPIs) plays a vital role in molecular biology. Particularly, infections are caused by the interactions of host and pathogen proteins. It is important to identify host-pathogen interactions (HPIs) to discover new drugs to counter infectious diseases. Conventional wet lab PPI prediction techniques have limitations in terms of large scale application and budget. Hence, computational approaches are developed to predict PPIs. This study aims to develop large margin machine learning models to predict interspecies PPIs with a special interest in host-pathogen protein interactions (HPIs). Especially, we focus on seeking answers to three queries that arise while developing an HPI predictor. 1) How should we select negative samples? 2) What should be the size of negative samples as compared to the positive samples? 3) What type of margin violation penalty should be used to train the predictor? We compare two available methods for negative sampling. Moreover, we propose a new method of assigning weights to each training example in weighted SVM depending on the distance of the negative examples from the positive examples. We have also developed a web server for our HPI predictor called HoPItor (Host Pathogen Interaction predicTOR) that can predict interactions between human and viral proteins. This webserver can be accessed at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#HoPItor.
FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification
Kowsari, Kamran, Bari, Nima, Vichr, Roman, Goodarzi, Farhad A.
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary Meta-Feature (FSL-BM) for big data classification task. The study of real-time algorithms addresses several major concerns, which are namely: accuracy, memory consumption, and ability to stretch assumptions and time complexity. Attaining a fast computational model providing fuzzy logic and supervised learning is one of the main challenges in the machine learning. In this research paper, we present FSL-BM algorithm as an efficient solution of supervised learning with fuzzy logic processing using binary meta-feature representation using Hamming Distance and Hash function to relax assumptions. While many studies focused on reducing time complexity and increasing accuracy during the last decade, the novel contribution of this proposed solution comes through integration of Hamming Distance, Hash function, binary meta-features, binary classification to provide real time supervised method. Hash Tables (HT) component gives a fast access to existing indices; and therefore, the generation of new indices in a constant time complexity, which supersedes existing fuzzy supervised algorithms with better or comparable results. To summarize, the main contribution of this technique for real-time Fuzzy Supervised Learning is to represent hypothesis through binary input as meta-feature space and creating the Fuzzy Supervised Hash table to train and validate model.
Generative Adversarial Active Learning
We propose a new active learning by query synthesis approach using Generative Adversarial Networks (GAN). Different from regular active learning, the resulting algorithm adaptively synthesizes training instances for querying to increase learning speed. We generate queries according to the uncertainty principle, but our idea can work with other active learning principles. We report results from various numerical experiments to demonstrate the effectiveness the proposed approach. In some settings, the proposed algorithm outperforms traditional pool-based approaches. To the best our knowledge, this is the first active learning work using GAN.
North Dakota Museum Property Rights Case Set to Trial
The case was considered in district court in 2014. The next year, the North Dakota Legislature rejected a bill that would have sided with the historical society and allowed the museum to stay on the fairgrounds. The case returned to district court in 2015, but the original judge recused himself at the end of last year.
4 Steps to Machine Learning with Pentaho
The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how PDI as a platform, with all its data munging[1] power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM[2] life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment. By "generic data preparation" we are referring to the process of connecting to (potentially) multiple heterogeneous data sources and then joining, blending, cleaning, filtering, deriving and denormalizing data so that it ready for consumption by machine learning (ML) algorithms. Further ML-specific data transformations, such as supervised discretization, one-hot encoding etc. can then be applied as needed in an ML tool. For the data scientist, PDI can be used to remove the repetitive drudgery involved with manually performing similar data preparation processes repetitively, from one dataset to the next.
pyLEMMINGS: Large Margin Multiple Instance Classification and Ranking for Bioinformatics Applications
Asif, Amina, Abbasi, Wajid Arshad, Munir, Farzeen, Ben-Hur, Asa, Minhas, Fayyaz ul Amir Afsar
Motivation: A major challenge in the development of machine learning based methods in computational biology is that data may not be accurately labeled due to the time and resources required for experimentally annotating properties of proteins and DNA sequences. Standard supervised learning algorithms assume accurate instance-level labeling of training data. Multiple instance learning is a paradigm for handling such labeling ambiguities. However, the widely used large-margin classification methods for multiple instance learning are heuristic in nature with high computational requirements. In this paper, we present stochastic sub-gradient optimization large margin algorithms for multiple instance classification and ranking, and provide them in a software suite called pyLEMMINGS. Results: We have tested pyLEMMINGS on a number of bioinformatics problems as well as benchmark datasets. pyLEMMINGS has successfully been able to identify functionally important segments of proteins: binding sites in Calmodulin binding proteins, prion forming regions, and amyloid cores. pyLEMMINGS achieves state-of-the-art performance in all these tasks, demonstrating the value of multiple instance learning. Furthermore, our method has shown more than 100-fold improvement in terms of running time as compared to heuristic solutions with improved accuracy over benchmark datasets. Availability and Implementation: pyLEMMINGS python package is available for download at: http://faculty.pieas.edu.pk/fayyaz/software.html#pylemmings.