Nearest Neighbor Methods
pGMM Kernel Regression and Comparisons with Boosted Trees
In this work, we demonstrate the advantage of the pGMM (``powered generalized min-max'') kernel in the context of (ridge) regression. In recent prior studies, the pGMM kernel has been extensively evaluated for classification tasks, for logistic regression, support vector machines, as well as deep neural networks. In this paper, we provide an experimental study on ridge regression, to compare the pGMM kernel regression with the ordinary ridge linear regression as well as the RBF kernel ridge regression. Perhaps surprisingly, even without a tuning parameter (i.e., $p=1$ for the power parameter of the pGMM kernel), the pGMM kernel already performs well. Furthermore, by tuning the parameter $p$, this (deceptively simple) pGMM kernel even performs quite comparably to boosted trees. Boosting and boosted trees are very popular in machine learning practice. For regression tasks, typically, practitioners use $L_2$ boost, i.e., for minimizing the $L_2$ loss. Sometimes for the purpose of robustness, the $L_1$ boost might be a choice. In this study, we implement $L_p$ boost for $p\geq 1$ and include it in the package of ``Fast ABC-Boost''. Perhaps also surprisingly, the best performance (in terms of $L_2$ regression loss) is often attained at $p>2$, in some cases at $p\gg 2$. This phenomenon has already been demonstrated by Li et al (UAI 2010) in the context of k-nearest neighbor classification using $L_p$ distances. In summary, the implementation of $L_p$ boost provides practitioners the additional flexibility of tuning boosting algorithms for potentially achieving better accuracy in regression applications.
Generating Pseudo-labels Adaptively for Few-shot Model-Agnostic Meta-Learning
Liu, Guodong, Wang, Tongling, Zhang, Shuoxi, He, Kun
Model-Agnostic Meta-Learning (MAML) is a famous few-shot learning method that has inspired many follow-up efforts, such as ANIL and BOIL. However, as an inductive method, MAML is unable to fully utilize the information of query set, limiting its potential of gaining higher generality. To address this issue, we propose a simple yet effective method that generates psuedo-labels adaptively and could boost the performance of the MAML family. The proposed methods, dubbed Generative Pseudo-label based MAML (GP-MAML), GP-ANIL and GP-BOIL, leverage statistics of the query set to improve the performance on new tasks. Specifically, we adaptively add pseudo labels and pick samples from the query set, then re-train the model using the picked query samples together with the support set. The GP series can also use information from the pseudo query set to re-train the network during the meta-testing. While some transductive methods, such as Transductive Propagation Network (TPN), struggle to achieve this goal. Experiments show that our methods, GP-MAML, GP-ANIL and GP-BOIL, can boost the performance of the corresponding model considerably, and achieve competitive performance as compared to the state-of-the-art baselines.
What is Model Performance measurement in machine learning?
As we discussed in the previous post about the K-Nearest Neighbor algorithm, it is a simple method to classify new data starting from known values. However, there must be a way to measure how well the predictions are made, to decide if it is the correct algorithm to solve our problem or if we need to make some adjustments to the model. Within Machine Learning this is known as Model-performance Measures, in addition to measuring efficiency, it can also be used to compare the performance of two algorithms, or to assess whether the model performance gets worse or better when evaluating new data. Depending on the type of data obtained, different performance measures are applied, for KNN and other similar classification algorithms we use accuracy as a performance metric, where accuracy is the total number of correct observations divided by the total number of observations made. To calculate model accuracy on the data, the training data set must be used to fit the classifier model, but taking into account that it cannot be calculated with unseen data, only known data so as not to generate false assumptions.
K-Nearest Neighbors, Naive Bayes, and Decision Tree in 10 Minutes
Unlike linear models and SVM (see Part 1), some machine learning models are really complex to learn from their mathematical formulation. Fortunately, they can be understood by following a step-by-step process they execute on a small dummy dataset. This way, you can uncover machine learning models under the hood without the "math bottleneck". You will learn three more models in this story after Part 1: K-Nearest Neighbors (KNN), Naive Bayes, and Decision Tree. KNN is a non-generalizing machine learning model since it simply "remembers" all of its train data.
Back to the Basics: Revisiting Out-of-Distribution Detection Baselines
We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a dead-simple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the in-distribution data).
Memory Safe Computations with XLA Compiler
Artemev, Artem, Roeder, Tilman, van der Wilk, Mark
Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.
Introduction to Machine Learning: K Nearest Neighbors (KNN) - PythonAlgos
K Nearest Neighbors or KNN is a standard Machine Learning algorithm used for classification. In KNN, we plot already labeled points with their label and then define decision boundaries based on the value of the hyperparameter "K". Hyperparameter just means a parameter that we control and can use for tuning. "K" is used to represent how many of the nearest neighbors we should take into account when determining the class of a new point. In this post we'll cover how to do KNN on two datasets, one contrived sample dataset and one more realistic dataset about wine from sklearn.
Enhancing Stochastic Petri Net-based Remaining Time Prediction using k-Nearest Neighbors
Vandenabeele, Jarne, Vermaut, Gilles, Peeperkorn, Jari, De Weerdt, Jochen
Reliable remaining time prediction of ongoing business processes is a highly relevant topic. One example is order delivery, a key competitive factor in e.g. retailing as it is a main driver of customer satisfaction. For realising timely delivery, an accurate prediction of the remaining time of the delivery process is crucial. Within the field of process mining, a wide variety of remaining time prediction techniques have already been proposed. In this work, we extend remaining time prediction based on stochastic Petri nets with generally distributed transitions with k-nearest neighbors. The k-nearest neighbors algorithm is performed on simple vectors storing the time passed to complete previous activities. By only taking a subset of instances, a more representative and stable stochastic Petri Net is obtained, leading to more accurate time predictions. We discuss the technique and its basic implementation in Python and use different real world data sets to evaluate the predictive power of our extension. These experiments show clear advantages in combining both techniques with regard to predictive power.
Predicting the Geoeffectiveness of CMEs Using Machine Learning
Pricopi, Andreea-Clara, Paraschiv, Alin Razvan, Besliu-Ionescu, Diana, Marginean, Anca-Nicoleta
ABSTRACT Coronal mass ejections (CMEs) are the most geoeffective space weather phenomena, being associated with large geomagnetic storms, having the potential to cause disturbances to telecommunication, satellite network disruptions, power grid damages and failures. Thus, considering these storms' potential effects on human activities, accurate forecasts of the geoeffectiveness of CMEs are paramount. This work focuses on experimenting with different machine learning methods trained on white-light coronagraph datasets of close to sun CMEs, to estimate whether such a newly erupting ejection has the potential to induce geomagnetic activity. We developed binary classification models using logistic regression, K-Nearest Neighbors, Support Vector Machines, feed forward artificial neural networks, as well as ensemble models. At this time, we limited our forecast to exclusively use solar onset parameters, to ensure extended warning times. We discuss the main challenges of this task, namely the extreme imbalance between the number of geoeffective and ineffective events in our dataset, along with their numerous similarities and the limited number of available variables. We show that even in such conditions, adequate hit rates can be achieved with these models. INTRODUCTION The purpose of this work is to develop a machine learning (ML) based model that can predict whether a coronal mass ejection (CME) will be geoeffective, using only numerical solar parameters as input. Coronal mass ejections are solar eruptive events whose magnetically charged particles can, directly or indirectly, under certain circumstances, reach Earth and cause geomagnetic storms (GSs), i.e., be geoeffective. These storms represent perturbations in the Earth's magnetic field, which have the potential to lead to electrical systems and grids failure and/or damage, power outages, navigation errors, radio signal perturbations, significant exposure to dangerous radiations for astronauts during space missions, etc. Given the potential negative impacts of such storms, predicting their occurrence is paramount for enabling safeguarding of human technology (Schwenn 2006; Pulkkinen 2007; Council 2013; Vourlidas et al. 2019; Temmer 2021). The intensity of the storms can be measured by various geomagnetic indices such as Ap, Kp, AE, PC or Dst (see Lockwood 2013, and references therein). Herein, we have chosen to use the values of the Dst index (Sugiura 1964) to establish whether the magnetic field perturbations do, in fact, manifest as storms. This is an index that is calculated using four geomagnetic stations situated at low latitudes. Depending on the value of this index, it can be established whether these perturbations are associated with geomagnetic storms or not. In terms of storm intensity, one of the most popular classifications that takes into consideration the minimum value of the Dst index is that of Gonzalez et al. (1994).
Non-Parametric Model
Non-parametric machine learning algorithms try to make assumptions about the data given the patterns observed from similar instances. For example, a popular non-parametric machine learning algorithm is the K-Nearest Neighbor algorithm that looks at similar training patterns for new instances. The only assumption it makes about the data set is that the training patterns that are the most similar are most likely to have a similar result. While non-parametric machine learning algorithms are often slower and require large amounts of data, they are rather flexible as they minimize the assumptions they make about the data.