Nearest Neighbor Methods
DNA: Denoised Neighborhood Aggregation for Fine-grained Category Discovery
An, Wenbin, Tian, Feng, Shi, Wenkai, Chen, Yan, Zheng, Qinghua, Wang, QianYing, Chen, Ping
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster representations. In this paper, we propose Denoised Neighborhood Aggregation (DNA), a self-supervised framework that encodes semantic structures of data into the embedding space. Specifically, we retrieve k-nearest neighbors of a query as its positive keys to capture semantic similarities between data and then aggregate information from the neighbors to learn compact cluster representations, which can make fine-grained categories more separatable. However, the retrieved neighbors can be noisy and contain many false-positive keys, which can degrade the quality of learned embeddings. To cope with this challenge, we propose three principles to filter out these false neighbors for better representation learning. Furthermore, we theoretically justify that the learning objective of our framework is equivalent to a clustering loss, which can capture semantic similarities between data to form compact fine-grained clusters. Extensive experiments on three benchmark datasets show that our method can retrieve more accurate neighbors (21.31% accuracy improvement) and outperform state-of-the-art models by a large margin (average 9.96% improvement on three metrics). Our code and data are available at https://github.com/Lackel/DNA.
Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs
In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful." We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process. We address this task by incorporating generated code and comment pairs. The initial dataset comprised 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful. To augment this dataset, we sourced an additional 739 lines of code-comment pairs and generated labels using a Large Language Model Architecture, specifically BERT. The primary objective was to build classification models that can effectively differentiate between useful and not useful code comments. Various machine learning algorithms were employed, including Logistic Regression, Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Gradient Boosting, Random Forest, and a Neural Network. Each algorithm was evaluated using precision, recall, and F1-score metrics, both with the original seed dataset and the augmented dataset. This study showcases the potential of generative AI for enhancing binary code comment quality classification models, providing valuable insights for software developers and researchers in the field of natural language processing and software engineering.
Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices
Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion tasks. We use a simple but robust approach called k-nearest neighbors voice conversion (kNN-VC). We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. The latter involves converting to a target voice specified through a text description, e.g. "a young man with a high-pitched voice". Compared to an established baseline, we find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some instruments like drums but not on others. Nevertheless, this shows that voice conversion models - and kNN-VC in particular - are increasingly applicable in a range of non-standard downstream tasks. But there are still limitations when samples are very far from the training distribution. Code, samples, trained models: https://rf5.github.io/sacair2023-knnvc-demo/.
Divorce Prediction with Machine Learning: Insights and LIME Interpretability
Divorce is one of the most common social issues in developed countries like in the United States. Almost 50% of the recent marriages turn into an involuntary divorce or separation. While it is evident that people vary to a different extent, and even over time, an incident like Divorce does not interrupt the individual's daily activities; still, Divorce has a severe effect on the individual's mental health, and personal life. Within the scope of this research, the divorce prediction was carried out by evaluating a dataset named by the 'divorce predictor dataset' to correctly classify between married and Divorce people using six different machine learning algorithms- Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Gaussian Na\"ive Bayes (NB), and, Support Vector Machines (SVM). Preliminary computational results show that algorithms such as SVM, KNN, and LDA, can perform that task with an accuracy of 98.57%. This work's additional novel contribution is the detailed and comprehensive explanation of prediction probabilities using Local Interpretable Model-Agnostic Explanations (LIME). Utilizing LIME to analyze test results illustrates the possibility of differentiating between divorced and married couples. Finally, we have developed a divorce predictor app considering ten most important features that potentially affect couples in making decisions in their divorce, such tools can be used by any one in order to identify their relationship condition.
Flexible K Nearest Neighbors Classifier: Derivation and Application for Ion-mobility Spectrometry-based Indoor Localization
The K Nearest Neighbors (KNN) classifier is widely used in many fields such as fingerprint-based localization or medicine. It determines the class membership of unlabelled sample based on the class memberships of the K labelled samples, the so-called nearest neighbors, that are closest to the unlabelled sample. The choice of K has been the topic of various studies and proposed KNN-variants. Yet no variant has been proven to outperform all other variants. In this paper a new KNN-variant is proposed which ensures that the K nearest neighbors are indeed close to the unlabelled sample and finds K along the way. The proposed algorithm is tested and compared to the standard KNN in theoretical scenarios and for indoor localization based on ion-mobility spectrometry fingerprints. It achieves a higher classification accuracy than the KNN in the tests, while requiring having the same computational demand.
Algebraic and Geometric Models for Space Networking
Bernardoni, William, Cardona, Robert, Cleveland, Jacob, Curry, Justin, Green, Robert, Heller, Brian, Hylton, Alan, Lam, Tung, Kassouf-Short, Robert
In this paper we introduce some new algebraic and geometric perspectives on networked space communications. Our main contribution is a novel definition of a time-varying graph (TVG), defined in terms of a matrix with values in subsets of the real line P(R). We leverage semi-ring properties of P(R) to model multi-hop communication in a TVG using matrix multiplication and a truncated Kleene star. This leads to novel statistics on the communication capacity of TVGs called lifetime curves, which we generate for large samples of randomly chosen STARLINK satellites, whose connectivity is modeled over day-long simulations. Determining when a large subsample of STARLINK is temporally strongly connected is further analyzed using novel metrics introduced here that are inspired by topological data analysis (TDA). To better model networking scenarios between the Earth and Mars, we introduce various semi-rings capable of modeling propagation delay as well as protocols common to Delay Tolerant Networking (DTN), such as store-and-forward. Finally, we illustrate the applicability of zigzag persistence for featurizing different space networks and demonstrate the efficacy of K-Nearest Neighbors (KNN) classification for distinguishing Earth-Mars and Earth-Moon satellite systems using time-varying topology alone.
5G Network Slicing: Analysis of Multiple Machine Learning Classifiers
Malkoc, Mirsad, Kholidy, Hisham A.
The division of one physical 5G communications infrastructure into several virtual network slices with distinct characteristics such as bandwidth, latency, reliability, security, and service quality is known as 5G network slicing. Each slice is a separate logical network that meets the requirements of specific services or use cases, such as virtual reality, gaming, autonomous vehicles, or industrial automation. The network slice can be adjusted dynamically to meet the changing demands of the service, resulting in a more cost-effective and efficient approach to delivering diverse services and applications over a shared infrastructure. This paper assesses various machine learning techniques, including the logistic regression model, linear discriminant model, k-nearest neighbor's model, decision tree model, random forest model, SVC BernoulliNB model, and GaussianNB model, to investigate the accuracy and precision of each model on detecting network slices. The report also gives an overview of 5G network slicing.
Compressor-Based Classification for Atrial Fibrillation Detection
Markov, Nikita, Ushenin, Konstantin, Bozhko, Yakov, Solovyova, Olga
Atrial fibrillation (AF) is one of the most common arrhythmias with challenging public health implications. Therefore, automatic detection of AF episodes on ECG is one of the essential tasks in biomedical engineering. In this paper, we applied the recently introduced method of compressor-based text classification with gzip algorithm for AF detection (binary classification between heart rhythms). We investigated the normalized compression distance applied to RR-interval and $\Delta$RR-interval sequences ($\Delta$RR-interval is the difference between subsequent RR-intervals). Here, the configuration of the k-nearest neighbour classifier, an optimal window length, and the choice of data types for compression were analyzed. We achieved good classification results while learning on the full MIT-BIH Atrial Fibrillation database, close to the best specialized AF detection algorithms (avg. sensitivity = 97.1\%, avg. specificity = 91.7\%, best sensitivity of 99.8\%, best specificity of 97.6\% with fivefold cross-validation). In addition, we evaluated the classification performance under the few-shot learning setting. Our results suggest that gzip compression-based classification, originally proposed for texts, is suitable for biomedical data and quantized continuous stochastic sequences in general.
Twin Neural Network Improved k-Nearest Neighbor Regression
Twin neural network regression is trained to predict differences between regression targets rather than the targets themselves. A solution to the original regression problem can be obtained by ensembling predicted differences between the targets of an unknown data point and multiple known anchor data points. Choosing the anchors to be the nearest neighbors of the unknown data point leads to a neural network-based improvement of k-nearest neighbor regression. This algorithm is shown to outperform both neural networks and k-nearest neighbor regression on small to medium-sized data sets.
Exploring Learned Representations of Neural Networks with Principal Component Analysis
Harlev, Amit, Engel, Andrew, Stinis, Panos, Chiang, Tony
Understanding feature representation for deep neural networks (DNNs) remains an open question within the general field of explainable AI. We use principal component analysis (PCA) to study the performance of a k-nearest neighbors classifier (k-NN), nearest class-centers classifier (NCC), and support vector machines on the learned layer-wise representations of a ResNet-18 trained on CIFAR-10. We show that in certain layers, as little as 20% of the intermediate feature-space variance is necessary for high-accuracy classification and that across all layers, the first ~100 PCs completely determine the performance of the k-NN and NCC classifiers. We relate our findings to neural collapse and provide partial evidence for the related phenomenon of intermediate neural collapse. Our preliminary work provides three distinct yet interpretable surrogate models for feature representation with an affine linear model the best performing. We also show that leveraging several surrogate models affords us a clever method to estimate where neural collapse may initially occur within the DNN.