Nearest Neighbor Methods
Measuring Classifier Model Performance
This is Day 28 of the #100DaysOfPython challenge. This post will take the work that was done yesterday in the blog post "First Look At Supervised Learning With Classification" and introduce the concept of training/test sets and output a graph for us to interpret the accuracy of the k-nearest neighbors classifier. At this stage, we will need to bring across our initial code from yesterday's post. The above code was introduced previous. From here on out, we want to create a training and test set for our classifier.
Prediction of the Facial Growth Direction is Challenging
Kaźmierczak, Stanisław, Juszka, Zofia, Vandevska-Radunovic, Vaska, Maal, Thomas JJ, Fudalej, Piotr, Mańdziuk, Jacek
Facial dysmorphology or malocclusion is frequently associated with abnormal growth of the face. The ability to predict facial growth (FG) direction would allow clinicians to prepare individualized therapy to increase the chance for successful treatment. Prediction of FG direction is a novel problem in the machine learning (ML) domain. In this paper, we perform feature selection and point the attribute that plays a central role in the abovementioned problem. Then we successfully apply data augmentation (DA) methods and improve the previously reported classification accuracy by 2.81%. Finally, we present the results of two experienced clinicians that were asked to solve a similar task to ours and show how tough is solving this problem for human experts.
Synthetic Data Does Not Reliably Protect Privacy, Researchers Claim
A new research collaboration between France and the UK casts doubt on growing industry confidence that synthetic data can resolve the privacy, quality and availability issues (among other issues) that threaten progress in the machine learning sector. Among several key points addressed, the authors assert that synthetic data modeled from real data retains enough of the genuine information as to provide no reliable protection from inference and membership attacks, which seek to deanonymize data and re-associate it with actual people. Furthermore, the individuals most at risk from such attacks, including those with critical medical conditions or high hospital bills (in the case of medical record anonymization) are, through the'outlier' nature of their condition, most likely to be re-identified by these techniques. 'Given access to a synthetic dataset, a strategic adversary can infer, with high confidence, the presence of a target record in the original data.' The paper also notes that differentially private synthetic data, which obscures the signature of individual records, does indeed protect individuals' privacy, but only by significantly crippling the usefulness of the information retrieval systems that use it.
Incremental Learning Techniques for Online Human Activity Recognition
Vakili, Meysam, Rezaei, Masoumeh
Unobtrusive and smart recognition of human activities using smartphones inertial sensors is an interesting topic in the field of artificial intelligence acquired tremendous popularity among researchers, especially in recent years. A considerable challenge that needs more attention is the real-time detection of physical activities, since for many real-world applications such as health monitoring and elderly care, it is required to recognize users' activities immediately to prevent severe damages to individuals' wellness. In this paper, we propose a human activity recognition (HAR) approach for the online prediction of physical movements, benefiting from the capabilities of incremental learning algorithms. We develop a HAR system containing monitoring software and a mobile application that collects accelerometer and gyroscope data and send them to a remote server via the Internet for classification and recognition operations. Six incremental learning algorithms are employed and evaluated in this work and compared with several batch learning algorithms commonly used for developing offline HAR systems. The Final results indicated that considering all performance evaluation metrics, Incremental K-Nearest Neighbors and Incremental Naive Bayesian outperformed other algorithms, exceeding a recognition accuracy of 95% in real-time.
Weighted Graph-Based Signal Temporal Logic Inference Using Neural Networks
Baharisangari, Nasim, Hirota, Kazuma, Yan, Ruixuan, Julius, Agung, Xu, Zhe
Extracting spatial-temporal knowledge from data is useful in many applications. It is important that the obtained knowledge is human-interpretable and amenable to formal analysis. In this paper, we propose a method that trains neural networks to learn spatial-temporal properties in the form of weighted graph-based signal temporal logic (wGSTL) formulas. For learning wGSTL formulas, we introduce a flexible wGSTL formula structure in which the user's preference can be applied in the inferred wGSTL formulas. In the proposed framework, each neuron of the neural networks corresponds to a subformula in a flexible wGSTL formula structure. We initially train a neural network to learn the wGSTL operators and then train a second neural network to learn the parameters in a flexible wGSTL formula structure. We use a COVID-19 dataset and a rain prediction dataset to evaluate the performance of the proposed framework and algorithms. We compare the performance of the proposed framework with three baseline classification methods including K-nearest neighbors, decision trees, and artificial neural networks. The classification accuracy obtained by the proposed framework is comparable with the baseline classification methods.
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms
Machine learning algorithms have enabled computers to predict things by learning from previous data. The data storage and processing power are increasing rapidly, thus increasing machine learning and Artificial intelligence applications. Much of the work is done to improve the accuracy of the models built in the past, with little research done to determine the computational costs of machine learning acquisitions. In this paper, I will proceed with this later research work and will make a performance comparison of multi-threaded machine learning clustering algorithms. I will be working on Linear Regression, Random Forest, and K-Nearest Neighbors to determine the performance characteristics of the algorithms as well as the computation costs to the obtained results. I will be benchmarking system hardware performance by running these multi-threaded algorithms to train and test the models on a dataset to note the differences in performance matrices of the algorithms. In the end, I will state the best performing algorithms concerning the performance efficiency of these algorithms on my system.
Under-bagging Nearest Neighbors for Imbalanced Classification
Hang, Hanyuan, Cai, Yuchao, Yang, Hanfang, Lin, Zhouchen
In this paper, we propose an ensemble learning algorithm called \textit{under-bagging $k$-nearest neighbors} (\textit{under-bagging $k$-NN}) for imbalanced classification problems. On the theoretical side, by developing a new learning theory analysis, we show that with properly chosen parameters, i.e., the number of nearest neighbors $k$, the expected sub-sample size $s$, and the bagging rounds $B$, optimal convergence rates for under-bagging $k$-NN can be achieved under mild assumptions w.r.t.~the arithmetic mean (AM) of recalls. Moreover, we show that with a relatively small $B$, the expected sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k$ can be reduced simultaneously, especially when the data are highly imbalanced, which leads to substantially lower time complexity and roughly the same space complexity. On the practical side, we conduct numerical experiments to verify the theoretical results on the benefits of the under-bagging technique by the promising AM performance and efficiency of our proposed algorithm.
A Framework for Supervised Heterogeneous Transfer Learning using Dynamic Distribution Adaptation and Manifold Regularization
Rahman, Md Geaur, Islam, Md Zahidul
Transfer learning aims to learn classifiers for a target domain by transferring knowledge from a source domain. However, due to two main issues: feature discrepancy and distribution divergence, transfer learning can be a very difficult problem in practice. In this paper, we present a framework called TLF that builds a classifier for the target domain having only few labeled training records by transferring knowledge from the source domain having many labeled records. While existing methods often focus on one issue and leave the other one for the further work, TLF is capable of handling both issues simultaneously. In TLF, we alleviate feature discrepancy by identifying shared label distributions that act as the pivots to bridge the domains. We handle distribution divergence by simultaneously optimizing the structural risk functional, joint distributions between domains, and the manifold consistency underlying marginal distributions. Moreover, for the manifold consistency we exploit its intrinsic properties by identifying k nearest neighbors of a record, where the value of k is determined automatically in TLF. Furthermore, since negative transfer is not desired, we consider only the source records that are belonging to the source pivots during the knowledge transfer. We evaluate TLF on seven publicly available natural datasets and compare the performance of TLF against the performance of eleven state-of-the-art techniques. We also evaluate the effectiveness of TLF in some challenging situations. Our experimental results, including statistical sign test and Nemenyi test analyses, indicate a clear superiority of the proposed framework over the state-of-the-art techniques.
Convolutional Neural Networks - AI Summary
Research by Hubel and Wiesel [2,3] analyzed the striate cortex of cats and monkeys, revealing two key findings that would come to heavily influence Fukushima's work [1]. The next significant implementation of a convolution neural network was LeNet-5 proposed in 1999 by Le Cun et al. in their work "Object Recognition with Gradient Based Learning'' [4]. Their proposed network, LeNet-5 performed well on the MNIST data set and was shown to do better than state of the art (at the time) SVMs and K-nearest neighbor based approaches. Their final implementation outperformed other state of the art image classification algorithms with error rates which were 10% lower than its competitors on the ImageNet dataset. This application of a discrete convolution precisely represents local receptive fields observed by Hubel and Wiesel [2,3] and implemented in early CNNs by Fukushima and Le Cun [1,4]. Research by Hubel and Wiesel [2,3] analyzed the striate cortex of cats and monkeys, revealing two key findings that would come to heavily influence Fukushima's work [1]. The next significant implementation of a convolution neural network was LeNet-5 proposed in 1999 by Le Cun et al. in their work "Object Recognition with Gradient Based Learning'' [4].
A New Interpolation Approach and Corresponding Instance-Based Learning
Starting from finding approximate value of a function, introduces the measure of approximation-degree between two numerical values, proposes the concepts of "strict approximation" and "strict approximation region", then, derives the corresponding one-dimensional interpolation methods and formulas, and then presents a calculation model called "sum-times-difference formula" for high-dimensional interpolation, thus develops a new interpolation approach, that is, ADB interpolation. ADB interpolation is applied to the interpolation of actual functions with satisfactory results. Viewed from principle and effect, the interpolation approach is of novel idea, and has the advantages of simple calculation, stable accuracy, facilitating parallel processing, very suiting for high-dimensional interpolation, and easy to be extended to the interpolation of vector valued functions. Applying the approach to instance-based learning, a new instance-based learning method, learning using ADB interpolation, is obtained. The learning method is of unique technique, which has also the advantages of definite mathematical basis, implicit distance weights, avoiding misclassification, high efficiency, and wide range of applications, as well as being interpretable, etc. In principle, this method is a kind of learning by analogy, which and the deep learning that belongs to inductive learning can complement each other, and for some problems, the two can even have an effect of "different approaches but equal results" in big data and cloud computing environment. Thus, the learning using ADB interpolation can also be regarded as a kind of "wide learning" that is dual to deep learning.