Performance Analysis
Learning From Noisy Labels By Regularized Estimation Of Annotator Confusion
Tanno, Ryutaro, Saeedi, Ardavan, Sankaranarayanan, Swami, Alexander, Daniel C., Silberman, Nathan
The predictive performance of supervised learning algorithms depends on the quality of labels. In a typical label collection process, multiple annotators provide subjective noisy estimates of the "truth" under the influence of their varying skill-levels and biases. Blindly treating these noisy labels as the ground truth limits the accuracy of learning algorithms in the presence of strong disagreement. This problem is critical for applications in domains such as medical imaging where both the annotation cost and inter-observer variability are high. In this work, we present a method for simultaneously learning the individual annotator model and the underlying true label distribution, using only noisy observations. Each annotator is modeled by a confusion matrix that is jointly estimated along with the classifier predictions. We propose to add a regularization term to the loss function that encourages convergence to the true annotator confusion matrix. We provide a theoretical argument as to how the regularization is essential to our approach both for the case of single annotator and multiple annotators. Despite the simplicity of the idea, experiments on image classification tasks with both simulated and real labels show that our method either outperforms or performs on par with the state-of-the-art methods and is capable of estimating the skills of annotators even with a single label available per image.
Introduction to "Advances in Financial Machine Learning" by Lopez de Prado
Machine learning is a buzzword often thrown about when discussing the future of finance and the world. You may have heard of neural networks solving problems in facial recognition, language processing, and even financial markets, yet without much explanation. It is easy to view this field as a black box, a magic machine that somehow produces solutions, but nobody knows why it works. It is true that machine learning techniques (neural networks in particular) pick up on obscure and hard to explain features, however there is more room for research, customization, and analysis than may first appear. Today we'll be discussing at a high level the various factors to be considered when researching investing through the lens of machine learning. The contents of this notebook and further discussions on this topic are heavily inspired by Marcos Lopez de Prado's book Advances in Financial Machine Learning.
Distance metric learning based on structural neighborhoods for dimensionality reduction and classification performance improvement
Ghods, Mostafa Razavi, Moattar, Mohammad Hossein, Forghani, Yahya
Distance metric learning can be viewed as one of the fundamental interests in pattern recognition and machine learning, which plays a pivotal role in the performance of many learning methods. One of the effective methods in learning such a metric is to learn it from a set of labeled training samples. The issue of data imbalance is the most important challenge of recent methods. This research tries not only to preserve the local structures but also covers the issue of imbalanced datasets. To do this, the proposed method first tries to extract a low dimensional manifold from the input data. Then, it learns the local neighborhood structures and the relationship of the data points in the ambient space based on the adjacencies of the same data points on the embedded low dimensional manifold. Using the local neighborhood relationships extracted from the manifold space, the proposed method learns the distance 1 metric in a way which minimizes the distance between similar data and maximizes their distance from the dissimilar data points. The evaluations of the proposed method on numerous datasets from the UCI repository of machine learning, and also the KDDCup98 dataset as the most imbalance dataset, justify the supremacy of the proposed approach in comparison with other approaches especially when the imbalance factor is high.
Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition
Yang, Xiao-Hui, Tian, Li, Chen, Yun-Mei, Yang, Li-Jun, Xu, Shuang, Wu, Wen-Ming
Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.
Community detection of survey responses based on Pearson correlation coefficient with Neo4j
Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them. To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from "Young People Survey" Kaggle dataset made available by Miroslav Sabo. It contains results of 1010 filled out surveys with questions ranging from music preferences, hobbies & interests to phobias. The nice thing about using Pearson correlation in scoring scenarios is that it takes into account when voters are generally more inclined to give higher or lower scores as it compares each score to the average score of the user.
Twitter Still Can't Keep Up With Its Flood of Junk Accounts, Study Finds
Since the world learned of state-sponsored campaigns to spread disinformation on social media and sway the 2016 election, Twitter has scrambled to rein in the bots and trolls polluting its platform. But when it comes to the larger problem of automated accounts on Twitter designed to spread spam and scams, inflate follower counts, and game trending topics, one study argues that the company still isn't keeping up with the deluge of garbage and abuse. In fact, the paper's two researchers write that with a machine learning approach they developed themselves, they could identify abusive accounts in far greater volumes and faster than Twitter does--often flagging the accounts months before Twitter spotted and banned them. In an 16-month study of 1.5 billion tweets, Zubair Shafiq, a computer science professor at the University of Iowa, and his graduate student Shehroze Farooqi, identified more than 167,000 apps using Twitter's API to automate bot accounts that spread tens of millions of tweets pushing spam, links to malware, and astroturfing campaigns. They write that more than 60 percent of the time, Twitter waited for those apps to send more than 100 tweets before identifying them as abusive; the researchers' own detection method had flagged the vast majority of the malicious apps after just a handful of tweets.
'I nearly aborted my baby because of an unreliable test'
When Claire Bell became pregnant she paid for a test that would indicate whether the baby had Down's Syndrome - and agreed to be screened for some other rare conditions at the same time. Not long afterwards, writes the BBC's Charlotte Hayward, she received what appeared to be terrible news. For five years, Claire Bell's husband was treated for two types of cancer. When it finally came to an end the couple decided to try having a baby through IVF, using some sperm her husband had had frozen and stored before he had chemotherapy. On the first round, at the age of 41, she became pregnant - and felt incredibly lucky. "It was this miraculous pregnancy," she says.
Link Prediction via Higher-Order Motif Features
Abuoda, Ghadeer, Morales, Gianmarco De Francisci, Aboulnaga, Ashraf
Link prediction requires predicting which new links are likely to appear in a graph. Being able to predict unseen links with good accuracy has important applications in several domains such as social media, security, transportation, and recommendation systems. A common approach is to use features based on the common neighbors of an unconnected pair of nodes to predict whether the pair will form a link in the future. In this paper, we present an approach for link prediction that relies on higher-order analysis of the graph topology, well beyond common neighbors. We treat the link prediction problem as a supervised classification problem, and we propose a set of features that depend on the patterns or motifs that a pair of nodes occurs in. By using motifs of sizes 3, 4, and 5, our approach captures a high level of detail about the graph topology within the neighborhood of the pair of nodes, which leads to a higher classification accuracy. In addition to proposing the use of motif-based features, we also propose two optimizations related to constructing the classification dataset from the graph. First, to ensure that positive and negative examples are treated equally when extracting features, we propose adding the negative examples to the graph as an alternative to the common approach of removing the positive ones. Second, we show that it is important to control for the shortest-path distance when sampling pairs of nodes to form negative examples, since the difficulty of prediction varies with the shortest-path distance. We experimentally demonstrate that using off-the-shelf classifiers with a well constructed classification dataset results in up to 10 percentage points increase in accuracy over prior topology-based and feature learning methods.
Machine learning and chord based feature engineering for genre prediction in popular Brazilian music
Wundervald, Bruna D., Zeviani, Walmes M.
Music genre can be hard to describe: many factors are involved, such as style, music technique, and historical context. Some genres even have overlapping characteristics. Looking for a better understanding of how music genres are related to musical harmonic structures, we gathered data about the music chords for thousands of popular Brazilian songs. Here, 'popular' does not only refer to the genre named MPB (Brazilian Popular Music) but to nine different genres that were considered particular to the Brazilian case. The main goals of the present work are to extract and engineer harmonically related features from chords data and to use it to classify popular Brazilian music genres towards establishing a connection between harmonic relationships and Brazilian genres. We also emphasize the generalisation of the method for obtaining the data, allowing for the replication and direct extension of this work. Our final model is a combination of multiple classification trees, also known as the random forest model. We found that features extracted from harmonic elements can satisfactorily predict music genre for the Brazilian case, as well as features obtained from the Spotify API. The variables considered in this work also give an intuition about how they relate to the genres.
An analytic formulation for positive-unlabeled learning via weighted integral probability metric
Kwon, Yongchan, Kim, Wonyoung, Sugiyama, Masashi, Paik, Myunghee Cho
We consider the problem of learning a binary classifier from only positive and unlabeled observations (PU learning). Although recent research in PU learning has succeeded in showing theoretical and empirical performance, most existing algorithms need to solve either a convex or a non-convex optimization problem and thus are not suitable for large-scale datasets. In this paper, we propose a simple yet theoretically grounded PU learning algorithm by extending the previous work proposed for supervised binary classification (Sriperumbudur et al., 2012). The proposed PU learning algorithm produces a closed-form classifier when the hypothesis space is a closed ball in reproducing kernel Hilbert space. In addition, we establish upper bounds of the estimation error and the excess risk. The obtained estimation error bound is sharper than existing results and the excess risk bound does not rely on an approximation error term. To the best of our knowledge, we are the first to explicitly derive the excess risk bound in the field of PU learning. Finally, we conduct extensive numerical experiments using both synthetic and real datasets, demonstrating improved accuracy, scalability, and robustness of the proposed algorithm.