Goto

Collaborating Authors

 Nearest Neighbor Methods


K-NN active learning under local smoothness assumption

arXiv.org Machine Learning

There is a large body of work on convergence rates either in passive or active learning. Here we first outline some of the main results that have been obtained, more specifically in a nonparametric setting under assumptions about the smoothness of the regression function (or the boundary between classes) and the margin noise. We discuss the relative merits of these underlying assumptions by putting active learning in perspective with recent work on passive learning. We design an active learning algorithm with a rate of convergence better than in passive learning, using a particular smoothness assumption customized for k-nearest neighbors. Unlike previous active learning algorithms, we use a smoothness assumption that provides a dependence on the marginal distribution of the instance space. Additionally, our algorithm avoids the strong density assumption that supposes the existence of the density function of the marginal distribution of the instance space and is therefore more generally applicable.


Machine Learning & Tensorflow - Google Cloud Approach

#artificialintelligence

Students who have at least high school knowledge in math and who want to start learning Machine Learning. Any intermediate level people who know the basics of machine learning, including the classical algorithms like linear regression or logistic regression, but who want to learn more about it and explore all the different fields of Machine Learning. Any people who are not that comfortable with coding but who are interested in Machine Learning and want to apply it easily on datasets. Anyone willing to learn machine learning on Google cloud platform. Any students in college who want to start a career in Data Science. Any data analysts who want to level up in Machine Learning.


Learning similarity measures from data

arXiv.org Machine Learning

Progress in Artificial Intelligence manuscript No. (will be inserted by the editor) Abstract Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, data sets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features, thus they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning. Additionally, we would like to do this while keeping training time as low as possible. Working towards this, our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced towards this goal, relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state of the art performance. Finally the evaluation shows that our fully data-driven similarity measure design outperforms state of the art methods while keeping training time low. Keywords Similarity Measure, Data Science, Neural Networks, Data Analytics, Case-Based Reasoning, Similarity Function, Siamese Networks, Similarity metrics, Distance Metrics This work was supported by the Research Council of Norway through the EXPOSED project(grant number 302002390) and the Norwegian Open AI Lab 1 Introduction Many artificial intelligence and machine learning (ML) methods, such as k-nearest neighbors (k-NN) rely on a similarity (or distance) measure [21] between data points. In Case-based reasoning (CBR) a simple k-NN or a more complex similarity function is used to retrieve the stored cases that are most similar to the current query case.


Develop k-Nearest Neighbors in Python From Scratch

#artificialintelligence

In this tutorial you are going to learn about the k-Nearest Neighbors algorithm including how it works and how to implement it from scratch in Python (without libraries). A simple but powerful approach for making predictions is to use the most similar historical examples to the new data. This is the principle behind the k-Nearest Neighbors algorithm. Discover how to code ML algorithms from scratch including kNN, decision trees, neural nets, ensembles and much more in my new book, with full Python code and no fancy libraries. Develop k-Nearest Neighbors in Python From Scratch Image taken from Wikipedia, some rights reserved.


Model-Agnostic Approaches to Multi-Objective Simultaneous Hyperparameter Tuning and Feature Selection

arXiv.org Machine Learning

Highly non-linear machine learning algorithms have the capacity to handle large, complex datasets. However, the predictive performance of a model usually critically depends on the choice of multiple hyperparameters. Optimizing these (often) constitutes an expensive black-box problem. Model-based optimization is one state-of-the-art method to address this problem. Furthermore, resulting models often lack interpretability, as models usually contain many active features with non-linear effects and higher-order interactions. One model-agnostic way to enhance interpretability is to enforce sparse solutions through feature selection. It is in many applications desirable to forego a small drop in performance for a substantial gain in sparseness, leading to a natural treatment of feature selection as a multi-objective optimization task. Despite the practical relevance of both hyperparameter optimization and feature selection, they are often carried out separately from each other, which is neither efficient, nor does it take possible interactions between hyperparameters and selected features into account. We present, discuss and compare two algorithmically different approaches for joint and multi-objective hyperparameter optimization and feature selection: The first uses multi-objective model-based optimization to tune a feature filter ensemble. The second is an evolutionary NSGA-II-based wrapper-approach to feature selection which incorporates specialized sampling, mutation and recombination operators for the joint decision space of included features and hyperparameter settings. We compare and discuss the approaches on a variety of benchmark tasks. While model-based optimization needs fewer objective evaluations to achieve good performance, it incurs significant overhead compared to the NSGA-II-based approach. The preferred choice depends on the cost of training the ML model on the given data.


If Data is the New Oil, How to Determine Its Value?

#artificialintelligence

My iPhone screen time is over four hours every day. Over the last month I've booked restaurant reservations and doctor's appointments, received motorcycle maintenance records, loaded new applications and ordered clothes. All of these actions involved the sort of data exchanges that today's information-based tech companies crave. Applying machine learning tools to personal data can uncover valuable knowledge and generate tremendous business value. With data increasingly seen as "the new oil," many economists, politicians, and others are suggesting people should be paid for the data they produce.


Prediction of Physical Load Level by Machine Learning Analysis of Heart Activity after Exercises

arXiv.org Machine Learning

The assessment of energy expenditure in real life is of great importance for monitoring the current physical state of people, especially in work, sport, elderly care, health care, and everyday life even. This work reports about application of some machine learning methods (linear regression, linear discriminant analysis, k-nearest neighbors, decision tree, random forest, Gaussian naive Bayes, support-vector machine) for monitoring energy expenditures in athletes. The classification problem was to predict the known level of the in-exercise loads (in three categories by calories) by the heart rate activity features measured during the short period of time (1 minute only) after training, i.e by features of the post-exercise load. The results obtained shown that the post-exercise heart activity features preserve the information of the in-exercise training loads and allow us to predict their actual in-exercise levels. The best performance can be obtained by the random forest classifier with all 8 heart rate features (micro-averaged area under curve value AUCmicro = 0.87 and macro-averaged one AUCmacro = 0.88) and the k-nearest neighbors classifier with 4 most important heart rate features (AUCmicro = 0.91 and AUCmacro = 0.89). The limitations and perspectives of the ML methods used are outlined, and some practical advices are proposed as to their improvement and implementation for the better prediction of in-exercise energy expenditures.


Extreme Learning Tree

arXiv.org Machine Learning

Anton Akusok 1, Emil Eirola 1, Kaj-Mikael Bj ork 2 Amaury Lendasse 3, 4 1 Arcada University of Applied Sciences, Helsinki, Finland 2 Risklab at Arcada UAS, Helsinki, Finland 3 Department of Mechanical and Industrial Engineering, The University of Iowa, Iowa City, USA 4 The Iowa Informatics Initiative, The University of Iowa, Iowa City, USA Abstract The paper proposes a new variant of a decision tree, called an Extreme Learning Tree. It consists of an extremely random tree with nonlinear data transformation, and a linear observer that provides predictions based on the leaf index where the data samples fall. The proposed method outperforms linear models on a benchmark dataset, and may be a building block for a future variant of Random Forest. 1 Introduction Randomized methods are a recent trend in practical machine learning [1]. They enable the high performance of complex nonlinear methods without the high computational cost of their optimization. Current most prominent examples are randomized neural networks, in both feed-forward [2] and recurrent [3] forms. For the latter, the randomized approach provided an efficient training method for the first time, and enabled achieving state-of-the-art performance in multiple areas [4].


Machine Learning Basics with the K-Nearest Neighbors Algorithm

#artificialintelligence

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm) is one that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data. Imagine a computer is a child, we are its supervisor (e.g. We will show the child several different pictures, some of which are pigs and the rest could be pictures of anything (cats, dogs, etc). When we see a pig, we shout "pig!"


A novel spike-and-wave automatic detection in EEG signals

arXiv.org Machine Learning

Spike-and-wave discharge (SWD) pattern classification in electroencephalography (EEG) signals is a key problem in signal processing. It is particularly important to develop a SWD automatic detection method in long-term EEG recordings since the task of marking the patters manually is time consuming, difficult and error-prone. This paper presents a new detection method with a low computational complexity that can be easily trained if standard medical protocols are respected. The detection procedure is as follows: First, each EEG signal is divided into several time segments and for each time segment, the Morlet 1-D decomposition is applied. Then three parameters are extracted from the wavelet coefficients of each segment: scale (using a generalized Gaussian statistical model), variance and median. This is followed by a k-nearest neighbors (k-NN) classifier to detect the spike-and-wave pattern in each EEG channel from these three parameters. A total of 106 spike-and-wave and 106 non-spike-and-wave were used for training, while 69 new annotated EEG segments from six subjects were used for classification. In these circumstances, the proposed methodology achieved 100% accuracy. These results generate new research opportunities for the underlying causes of the so-called absence epilepsy in long-term EEG recordings.