Accuracy
KTBoost: Combined Kernel and Tree Boosting
Boosting algorithms [Freund et al., 1996, Friedman et al., 2000, Mason et al., 2000, Friedman, 2001,Bรผhlmann and Hothorn, 2007] enjoy large popularity in both applied data analysis and machine learning research due to their high predictive accuracy observed on a wide range of data sets [Chen and Guestrin, 2016]. Boosting additively combines base learners by sequentially minimizing a risk functional. To the best of our knowledge, except for one reference [Hothorn et al., 2010], the large majority of boosting algorithms use only one type of functions as base learners. In this article, we relax this assumption by combining trees[Breiman et al., 1984] and reproducing kernel Hilbert space (RKHS) regression functions [Schรถlkopf and Smola, 2001, Berlinet and Thomas-Agnan, 2011] as base learners, and we empirically show that this combination of different base learners results in increased predictive accuracy compared to both only tree and kernel boosting. To date, regression trees are the most common choice of base learners for boosting in both applied data analysis and machine learning research. In particular, a lot of effort has been made in recent years to develop tree-based boosting methods that scale to large data [Chen and Guestrin, 2016, Ke et al., 2017, Ponomareva et al., 2017, Prokhorenkova et al., 2018]. On the other hand, kernel machines show state-of-the-art predictive accuracy for many data sets as they can, for instance, achieve near-optimal test error [Belkin et al., 2018b,a], and kernel classifiers parallel the behaviors of deep networks as noted in Zhang
Reconstructing dynamical networks via feature ranking
Leguia, Marc G., Levnajic, Zoran, Todorovski, Ljupco, Zenko, Bernard
Empirical data on real complex systems are becoming increasingly available. Parallel to this is the need for new methods of reconstructing (inferring) the topology of networks from time-resolved observations of their node-dynamics. The methods based on physical insights often rely on strong assumptions about the properties and dynamics of the scrutinized network. Here, we use the insights from machine learning to design a new method of network reconstruction that essentially makes no such assumptions. Specifically, we interpret the available trajectories (data) as features, and use two independent feature ranking approaches -- Random forest and RReliefF -- to rank the importance of each node for predicting the value of each other node, which yields the reconstructed adjacency matrix. We show that our method is fairly robust to coupling strength, system size, trajectory length and noise. We also find that the reconstruction quality strongly depends on the dynamical regime.
Crime Linkage Detection by Spatio-Temporal-Textual Point Processes
Crimes emerge out of complex interactions of behaviors and situations; thus there are complex linkages between crime incidents. Solving the puzzle of crime linkage is a highly challenging task because we often only have limited information from indirect observations such as records, text descriptions, and associated time and locations. We propose a new modeling and learning framework for detecting linkage between crime events using \textit{spatio-temporal-textual} data, which are highly prevalent in the form of police reports. We capture the notion of \textit{modus operandi} (M.O.), by introducing a multivariate marked point process and handling the complex text jointly with the time and location. The model is able to discover the latent space that links the crime series. The model fitting is achieved by a computationally efficient Expectation-Maximization (EM) algorithm. In addition, we explicitly reduce the bias in the text documents in our algorithm. Our numerical results using real data from the Atlanta Police show that our method has competitive performance relative to the state-of-the-art. Our results, including variable selection, are highly interpretable and may bring insights into M.O. extraction.
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
Schubert, Erich, Zimek, Arthur
This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality.
Identifying Fake News from Twitter Sharing Data: A Large-Scale Study
Agrawal, Rakshit, de Alfaro, Luca, Ballarin, Gabriele, Moret, Stefano, Di Pierro, Massimo, Tacchini, Eugenio, Della Vedova, Marco L.
Social networks offer a ready channel for fake and misleading news to spread and exert influence. This paper examines the performance of different reputation algorithms when applied to a large and statistically significant portion of the news that are spread via Twitter. Our main result is that simple crowdsourcing-based algorithms are able to identify a large portion of fake or misleading news, while incurring only very low false positive rates for mainstream websites. We believe that these algorithms can be used as the basis of practical, large-scale systems for indicating to consumers which news sites deserve careful scrutiny and skepticism.
Learning From Noisy Labels By Regularized Estimation Of Annotator Confusion
Tanno, Ryutaro, Saeedi, Ardavan, Sankaranarayanan, Swami, Alexander, Daniel C., Silberman, Nathan
The predictive performance of supervised learning algorithms depends on the quality of labels. In a typical label collection process, multiple annotators provide subjective noisy estimates of the "truth" under the influence of their varying skill-levels and biases. Blindly treating these noisy labels as the ground truth limits the accuracy of learning algorithms in the presence of strong disagreement. This problem is critical for applications in domains such as medical imaging where both the annotation cost and inter-observer variability are high. In this work, we present a method for simultaneously learning the individual annotator model and the underlying true label distribution, using only noisy observations. Each annotator is modeled by a confusion matrix that is jointly estimated along with the classifier predictions. We propose to add a regularization term to the loss function that encourages convergence to the true annotator confusion matrix. We provide a theoretical argument as to how the regularization is essential to our approach both for the case of single annotator and multiple annotators. Despite the simplicity of the idea, experiments on image classification tasks with both simulated and real labels show that our method either outperforms or performs on par with the state-of-the-art methods and is capable of estimating the skills of annotators even with a single label available per image.
Introduction to "Advances in Financial Machine Learning" by Lopez de Prado
Machine learning is a buzzword often thrown about when discussing the future of finance and the world. You may have heard of neural networks solving problems in facial recognition, language processing, and even financial markets, yet without much explanation. It is easy to view this field as a black box, a magic machine that somehow produces solutions, but nobody knows why it works. It is true that machine learning techniques (neural networks in particular) pick up on obscure and hard to explain features, however there is more room for research, customization, and analysis than may first appear. Today we'll be discussing at a high level the various factors to be considered when researching investing through the lens of machine learning. The contents of this notebook and further discussions on this topic are heavily inspired by Marcos Lopez de Prado's book Advances in Financial Machine Learning.
Distance metric learning based on structural neighborhoods for dimensionality reduction and classification performance improvement
Ghods, Mostafa Razavi, Moattar, Mohammad Hossein, Forghani, Yahya
Distance metric learning can be viewed as one of the fundamental interests in pattern recognition and machine learning, which plays a pivotal role in the performance of many learning methods. One of the effective methods in learning such a metric is to learn it from a set of labeled training samples. The issue of data imbalance is the most important challenge of recent methods. This research tries not only to preserve the local structures but also covers the issue of imbalanced datasets. To do this, the proposed method first tries to extract a low dimensional manifold from the input data. Then, it learns the local neighborhood structures and the relationship of the data points in the ambient space based on the adjacencies of the same data points on the embedded low dimensional manifold. Using the local neighborhood relationships extracted from the manifold space, the proposed method learns the distance 1 metric in a way which minimizes the distance between similar data and maximizes their distance from the dissimilar data points. The evaluations of the proposed method on numerous datasets from the UCI repository of machine learning, and also the KDDCup98 dataset as the most imbalance dataset, justify the supremacy of the proposed approach in comparison with other approaches especially when the imbalance factor is high.
Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition
Yang, Xiao-Hui, Tian, Li, Chen, Yun-Mei, Yang, Li-Jun, Xu, Shuang, Wu, Wen-Ming
Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.
Community detection of survey responses based on Pearson correlation coefficient with Neo4j
Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them. To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from "Young People Survey" Kaggle dataset made available by Miroslav Sabo. It contains results of 1010 filled out surveys with questions ranging from music preferences, hobbies & interests to phobias. The nice thing about using Pearson correlation in scoring scenarios is that it takes into account when voters are generally more inclined to give higher or lower scores as it compares each score to the average score of the user.