Nearest Neighbor Methods
Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis
Mowri, Rawshan Ara, Siddula, Madhuri, Roy, Kaushik
Ransomware has appeared as one of the major global threats in recent days. The alarming increasing rate of ransomware attacks and new ransomware variants intrigue the researchers to constantly examine the distinguishing traits of ransomware and refine their detection strategies. Application Programming Interface (API) is a way for one program to collaborate with another; API calls are the medium by which they communicate. Ransomware uses this strategy to interact with the OS and makes a significantly higher number of calls in different sequences to ask for taking action. This research work utilizes the frequencies of different API calls to detect and classify ransomware families. First, a Web-Crawler is developed to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families. By extracting different frequencies of 68 API calls, we develop our dataset in the first phase of the two-phase feature engineering process. After selecting the most significant features in the second phase of the feature engineering process, we deploy six Supervised Machine Learning models: Na"ive Bayes, Logistic Regression, Random Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and Support Vector Machine. Then, the performances of all the classifiers are compared to select the best model. The results reveal that Logistic Regression can efficiently classify ransomware into their corresponding families securing 99.15% overall accuracy. Finally, instead of relying on the 'Black box' characteristic of the Machine Learning models, we present the post-hoc analysis of our best-performing model using 'SHapley Additive exPlanations' or SHAP values to ascertain the transparency and trustworthiness of the model's prediction.
CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification
Schoch, Stephanie, Xu, Haifeng, Ji, Yangfeng
Data valuation, or the valuation of individual datum contributions, has seen growing interest in machine learning due to its demonstrable efficacy for tasks such as noisy label detection. In particular, due to the desirable axiomatic properties, several Shapley value approximation methods have been proposed. In these methods, the value function is typically defined as the predictive accuracy over the entire development set. However, this limits the ability to differentiate between training instances that are helpful or harmful to their own classes. Intuitively, instances that harm their own classes may be noisy or mislabeled and should receive a lower valuation than helpful instances. In this work, we propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions. Our theoretical analysis shows the proposed value function is (essentially) the unique function that satisfies two desirable properties for evaluating data values in classification. Further, our experiments on two benchmark evaluation tasks (data removal and noisy label detection) and four classifiers demonstrate the effectiveness of CS-Shapley over existing methods. Lastly, we evaluate the "transferability" of data values estimated from one classifier to others, and our results suggest Shapley-based data valuation is transferable for application across different models.
Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning
Yuan, Mingqi, Li, Bo, Jin, Xin, Zeng, Wenjun
Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the R\'enyi divergence-based visitation discrepancy between episodes. To make efficient divergence estimation, a k-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on Atari games and PyBullet Robotics Environments. Extensive experiments demonstrate that REVD can significantly improves the sample efficiency of reinforcement learning algorithms and outperforms the benchmarking methods.
Hyperbolic Centroid Calculations for Text Classification
Gerek, Aydฤฑn, Ferahlar, Cรผneyt, Sert, Bilge ลipal, Yรผney, Mehmet Can, Taลdemir, Onur, Kalafat, Zeynep Billur, Kelkit, Mert, Ganiz, Murat Can
A new development in NLP is the construction of hyperbolic word embeddings. As opposed to their Euclidean counterparts, hyperbolic embeddings are represented not by vectors, but by points in hyperbolic space. This makes the most common basic scheme for constructing document representations, namely the averaging of word vectors, meaningless in the hyperbolic setting. We reinterpret the vector mean as the centroid of the points represented by the vectors, and investigate various hyperbolic centroid schemes and their effectiveness at text classification.
Non-Parametric Domain Adaptation for End-to-End Speech Translation
Du, Yichao, Wang, Weizhi, Zhang, Zhirui, Chen, Boxing, Xu, Tong, Xie, Jun, Chen, Enhong
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters. However, the effectiveness of neural-based approaches to this task is severely limited by the available training corpus, especially for domain adaptation where in-domain triplet training data is scarce or nonexistent. In this paper, we propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system. To this end, we first incorporate an additional encoder into the pre-trained E2E-ST model to realize text translation modelling, and then unify the decoder's output representation for text and speech translation tasks by reducing the correspondent representation mismatch in available triplet training data. During domain adaptation, a k-nearest-neighbor (kNN) classifier is introduced to produce the final translation distribution using the external datastore built by the domain-specific text translation corpus, while the universal output representation is adopted to perform a similarity search. Experiments on the Europarl-ST benchmark demonstrate that when in-domain text translation data is involved only, our proposed approach significantly improves baseline by 12.82 BLEU on average in all translation directions, even outperforming the strong in-domain fine-tuning method.
kNN-Prompt: Nearest Neighbor Zero-Shot Inference
Shi, Weijia, Michael, Julian, Gururangan, Suchin, Zettlemoyer, Luke
Retrieval-augmented language models (LMs) use non-parametric memory to substantially outperform their non-retrieval counterparts on perplexity-based evaluations, but it is an open question whether they achieve similar gains in few- and zero-shot end-task accuracy. We extensively study one such model, the k-nearest neighbor LM (kNN-LM), showing that the gains marginally transfer. The main challenge is to achieve coverage of the verbalizer tokens that define the different end-task class labels. To address this challenge, we also introduce kNN-Prompt, a simple and effective kNN-LM with automatically expanded fuzzy verbalizers (e.g. to expand terrible to also include silly and other task-specific synonyms for sentiment classification). Across nine diverse end-tasks, using kNN-Prompt with GPT-2 large yields significant performance boosts over strong zero-shot baselines (13.4% absolute improvement over the base LM on average). We also show that other advantages of non-parametric augmentation hold for end tasks; kNN-Prompt is effective for domain adaptation with no further training, and gains increase with the size of the retrieval model.
Applications of K Nearest Neighbor algorithm part1(Artificial Intelligence)
Abstract: Demands for minimum parameter setup in machine learning models are desirable to avoid time-consuming optimization processes. The k-Nearest Neighbors is one of the most effective and straightforward models employed in numerous problems. Despite its well-known performance, it requires the value of k for specific data distribution, thus demanding expensive computational efforts. This paper proposes a k-Nearest Neighbors classifier that bypasses the need to define the value of k. The model computes the k value adaptively considering the data distribution of the training set.
Applications of K Nearest Neighbor algorithm part2(Artificial Intelligence)
Abstract: Candidate generation is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. Since candidate generation is the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach for candidate generation is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests.
Quantum Semi-Supervised Learning with Quantum Supremacy
Quantum machine learning promises to efficiently solve important problems. There are two persistent challenges in classical machine learning: the lack of labeled data, and the limit of computational power. We propose a novel framework that resolves both issues: quantum semi-supervised learning. Moreover, we provide a protocol in systematically designing quantum machine learning algorithms with quantum supremacy, which can be extended beyond quantum semi-supervised learning. In the meantime, we show that naive quantum matrix product estimation algorithm outperforms the best known classical matrix multiplication algorithm. We showcase two concrete quantum semi-supervised learning algorithms: a quantum self-training algorithm named the propagating nearest-neighbor classifier, and the quantum semi-supervised K-means clustering algorithm. By doing time complexity analysis, we conclude that they indeed possess quantum supremacy.