Goto

Collaborating Authors

 Performance Analysis


Host-based anomaly detection using Eigentraces feature extraction and one-class classification on system call trace data

arXiv.org Machine Learning

This paper proposes a methodology for host-based anomaly detection using a semi-supervised algorithm namely one-class classifier combined with a PCA-based feature extraction technique called Eigentraces on system call trace data. The one-class classification is based on generating a set of artificial data using a reference distribution and combining the target class probability function with artificial class density function to estimate the target class density function through the Bayes formulation. The benchmark dataset, ADFA-LD, is employed for the simulation study. ADFA-LD dataset contains thousands of system call traces collected during various normal and attack processes for the Linux operating system environment. In order to pre-process and to extract features, windowing on the system call trace data followed by the principal component analysis which is named as Eigentraces is implemented. The target class probability function is modeled separately by Radial Basis Function neural network and Random Forest machine learners for performance comparison purposes. The simulation study showed that the proposed intrusion detection system offers high performance for detecting anomalies and normal activities with respect to a set of well-accepted metrics including detection rate, accuracy, and missed and false alarm rates.


Making Learners (More) Monotone

arXiv.org Machine Learning

Learning performance can show non-monotonic behavior. That is, more data does not necessarily lead to better models, even on average. We propose three algorithms that take a supervised learning model and make it perform more monotone. We prove consistency and monotonicity with high probability, and evaluate the algorithms on scenarios where non-monotone behaviour occurs. Our proposed algorithm $\text{MT}_{\text{HT}}$ makes less than $1\%$ non-monotone decisions on MNIST while staying competitive in terms of error rate compared to several baselines.


Matrix Normal PCA for Interpretable Dimension Reduction and Graphical Noise Modeling

arXiv.org Machine Learning

Principal component analysis (PCA) is one of the most widely used dimension reduction and multivariate statistical techniques. From a probabilistic perspective, PCA seeks a low-dimensional representation of data in the presence of independent identical Gaussian noise. Probabilistic PCA (PPCA) and its variants have been extensively studied for decades. Most of them assume the underlying noise follows a certain independent identical distribution. However, the noise in the real world is usually complicated and structured. To address this challenge, some non-linear variants of PPCA have been proposed. But those methods are generally difficult to interpret. To this end, we propose a powerful and intuitive PCA method (MN-PCA) through modeling the graphical noise by the matrix normal distribution, which enables us to explore the structure of noise in both the feature space and the sample space. MN-PCA obtains a low-rank representation of data and the structure of noise simultaneously. And it can be explained as approximating data over the generalized Mahalanobis distance. We develop two algorithms to solve this model: one maximizes the regularized likelihood, the other exploits the Wasserstein distance, which is more robust. Extensive experiments on various data demonstrate their effectiveness.


Machine Learning Model Best Practices

#artificialintelligence

The number of shiny models out there can be overwhelming, which means a lot of times people fallback on a few they trust the most, and use them on all new problems. This can lead to sub-optimal results. Today we're going to learn how to quickly and efficiently narrow down the space of available models to find those that are most likely to perform best on your problem type. We'll also see how we can keep track of our models' performances using Weights and Biases and compare them. Unlike Lord of the Rings, in machine learning there is no one ring (model) to rule them all.


Understanding Classification Thresholds Using Isocurves

#artificialintelligence

You are in a conference room, presenting your work on a classification problem. You demonstrate all the magic you performed with feature engineering, predictor selection, model selection, hyperparameter tuning, and ensembling. You conclude your presentation with the predicted probabilities and the ROC curve, and a fantastic AUC. You sit down confident in a job well done. And the manager says, "What am I supposed to do with all this probability and ROC stuff? I just want to know if I should do x or y."


Large expert-curated database for benchmarking document similarity detection in biomedical literature search

#artificialintelligence

Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations.


DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles

arXiv.org Machine Learning

Selecting and combining the outlier scores of different base detectors used within outlier ensembles can be quite challenging in the absence of ground truth. In this paper, an unsupervised outlier detector combination framework called DCSO is proposed, demonstrated and assessed for th e dynamic selection of most competent base detectors, with an emphasis on data locality. Th e proposed DCSO framework first defines the local region of a tes t instance by its k nearest neighbors and then identifies the top-performing base detectors within t he local region. Experimental results on ten benchmark datasets demonstrate that DCSO provides consistent performance i mprovement over existing stati c combination approaches in mining outlying objects. To facilitate interpretability and reliability of the proposed method, DCSO i s analyzed using both theoretica l frameworks and visualization techniques, and presented alongside empirical parameter setting instructions that can be used to improve the overall performance.


Schema Matching using Machine Learning

arXiv.org Artificial Intelligence

--Schema Matching is a method of finding attributes that are either similar to each other linguistically or represent the same information. In this project, we take a hybrid approach at solving this problem by making use of both the provided data and the schema name to perform one to one schema matching and introduce creation of a global dictionary to achieve one to many schema matching. We experiment with two methods of one to one matching and compare both based on their F-scores, precision and recall. We also compare our method with the ones previously suggested and highlight differences between them. The schema of a database is the skeleton that represents its logical view. In other words, a schema describes the data contained in a database, with the name of each attribute in a relation and its data type contained in the relation's schema. Any time the different tables maintained by a peer management system need to be linked to each other or when one branch of a company is closed down and all its data needs to be redistributed to the database maintained by other branches or when one company takes over another company and all data of the child comapny needs to be integrated with that of the parent company, the need to match schemas of multiple relations with each other arises. Consider the Tables I and II. Here, the ideal schema mappings would be: FName LName Name, Major Maj Stream and Address House No St Name .


Coordination Event Detection and Initiator Identification in Time Series Data

arXiv.org Artificial Intelligence

Behavior initiation is a form of leadership and is an important aspect of social organization that affects the processes of group formation, dynamics, and decision-making in human societies and other social animal species. In this work, we formalize the "Coordination Initiator Inference Problem" and propose a simple yet powerful framework for extracting periods of coordinated activity and determining individuals who initiated this coordination, based solely on the activity of individuals within a group during those periods. The proposed approach, given arbitrary individual time series, automatically (1) identifies times of coordinated group activity, (2) determines the identities of initiators of those activities, and (3) classifies the likely mechanism by which the group coordination occurred, all of which are novel computational tasks. We demonstrate our framework on both simulated and real-world data: trajectories tracking of animals as well as stock market data. Our method is competitive with existing global leadership inference methods but provides the first approaches for local leadership and coordination mechanism classification. Our results are consistent with ground-truthed biological data and the framework finds many known events in financial data which are not otherwise reflected in the aggregate NASDAQ index. Our method is easily generalizable to any coordinated time-series data from interacting entities.


How Do You #relax When You're #stressed? A Content Analysis and Infodemiology Study of Stress-Related Tweets

arXiv.org Artificial Intelligence

Background: Stress is a contributing factor to many major health problems in the United States, such as heart disease, depression, and autoimmune diseases. Relaxation is often recommended in mental health treatment as a frontline strategy to reduce stress, thereby improving health conditions. Objective: The objective of our study was to understand how people express their feelings of stress and relaxation through Twitter messages. Methods: We first performed a qualitative content analysis of 1326 and 781 tweets containing the keywords "stress" and "relax", respectively. We then investigated the use of machine learning algorithms to automatically classify tweets as stress versus non stress and relaxation versus non relaxation. Finally, we applied these classifiers to sample datasets drawn from 4 cities with the goal of evaluating the extent of any correlation between our automatic classification of tweets and results from public stress surveys. Results: Content analysis showed that the most frequent topic of stress tweets was education, followed by work and social relationships. The most frequent topic of relaxation tweets was rest and vacation, followed by nature and water. When we applied the classifiers to the cities dataset, the proportion of stress tweets in New York and San Diego was substantially higher than that in Los Angeles and San Francisco. Conclusions: This content analysis and infodemiology study revealed that Twitter, when used in conjunction with natural language processing techniques, is a useful data source for understanding stress and stress management strategies, and can potentially supplement infrequently collected survey-based stress data.