Goto

Collaborating Authors

 Performance Analysis


Adversarial Transferability in Wearable Sensor Systems

arXiv.org Machine Learning

Machine learning has increasingly become the most used approach for inference and decision making in wearable sensor systems. However, recent studies have found that machine learning systems are easily fooled by the addition of adversarial perturbation to their inputs. What is more interesting is that the adversarial examples generated for one machine learning system can also degrade the performance of another. This property of adversarial examples is called transferability. In this work, we take the first strides in studying adversarial transferability in wearable sensor systems, from the following perspectives: 1) Transferability between machine learning models, 2) Transferability across subjects, 3) Transferability across sensor locations, and 4) Transferability across datasets. With Human Activity Recognition (HAR) as an example sensor system, we found strong untargeted transferability in all cases of transferability. Specifically, gradient-based attacks were able to achieve higher misclassification rates compared to non-gradient attacks. The misclassification rate of untargeted adversarial examples ranged from 20% to 98%. For targeted transferability between machine learning models, the success rate of adversarial examples was 100% for iterative attack methods. However, the success rate for other types of targeted transferability ranged from 20% to 0%. Our findings strongly suggest that adversarial transferability has serious consequences not only in sensor systems but also across the broad spectrum of ubiquitous computing.


ParKCa: Causal Inference with Partially Known Causes

arXiv.org Machine Learning

Causal Inference methods based on observational data are an alternative for applications where collecting the counterfactual data or realizing a more standard experiment is not possible. In this work, our goal is to combine several observational causal inference methods to learn new causes in applications where some causes are well known. We validate the proposed method on The Cancer Genome Atlas (TCGA) dataset to identify genes that potentially cause metastasis.


Key Phrase Classification in Complex Assignments

arXiv.org Artificial Intelligence

Complex assignments typically consist of open-ended questions with large and diverse content in the context of both classroom and online graduate programs. With the sheer scale of these programs comes a variety of problems in peer and expert feedback, including rogue reviews. As such with the hope of identifying important contents needed for the review, in this work we present a very first work on key phrase classification with a detailed empirical study on traditional and most recent language modeling approaches. From this study, we find that the task of classification of key phrases is ambiguous at a human level producing Cohen's kappa of 0.77 on a new data set. Both pretrained language models and simple TFIDF SVM classifiers produce similar results with a former producing average of 0.6 F1 higher than the latter. We finally derive practical advice from our extensive empirical and model interpretability results for those interested in key phrase classification from educational reports in the future.


AutoCogniSys: IoT Assisted Context-Aware Automatic Cognitive Health Assessment

arXiv.org Artificial Intelligence

Cognitive impairment has become epidemic in older adult population. The recent advent of tiny wearable and ambient devices, a.k.a Internet of Things (IoT) provides ample platforms for continuous functional and cognitive health assessment of older adults. In this paper, we design, implement and evaluate AutoCogniSys, a context-aware automated cognitive health assessment system, combining the sensing powers of wearable physiological (Electrodermal Activity, Photoplethysmography) and physical (Accelerometer, Object) sensors in conjunction with ambient sensors. We design appropriate signal processing and machine learning techniques, and develop an automatic cognitive health assessment system in a natural older adults living environment. We validate our approaches using two datasets: (i) a naturalistic sensor data streams related to Activities of Daily Living and mental arousal of 22 older adults recruited in a retirement community center, individually living in their own apartments using a customized inexpensive IoT system (IRB #HP-00064387) and (ii) a publicly available dataset for emotion detection. The performance of AutoCogniSys attests max. 93\% of accuracy in assessing cognitive health of older adults.


Why understanding your fraud false-positive rate is key to growing your business

#artificialintelligence

'Ecommerce businesses have a problem - one that causes lost customer revenue, yet has been historically nearly impossible to solve' Geoff Huang, VP of Product at Sift The problem stems from the inability to know their false-positive rate, which is the percentage of orders from legitimate customers that are mistakenly blocked as fraud. According to a survey conducted by CNP, 42% of ecommerce merchants don't know their false-positive rate (also known as customer insult rate). That is a startling statistic--nearly half of online sellers have no visibility into the number of good orders they inadvertently block or the subsequent revenue lost from those orders. And the news, unfortunately, doesn't get much better. Sift polled 1,000 adult consumers and found roughly 25% of insulted online shoppers--those who were falsely declined--will take their business to a competitor.


DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift

arXiv.org Machine Learning

When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The advantage of our approach is that we can use aggressive drift detection in the stable state to achieve a high detection rate, but mitigate the false positive rate of standalone drift detection via a reactive state that reacts quickly to true drifts while eliminating most false positives. The algorithm is generic in its base learner and can be applied across a variety of supervised learning problems. Our theoretical analysis shows that the risk of the algorithm is competitive to an algorithm with oracle knowledge of when (abrupt) drifts occur. Experiments on synthetic and real datasets with concept drifts confirm our theoretical analysis.


Towards a Resilient Machine Learning Classifier -- a Case Study of Ransomware Detection

arXiv.org Machine Learning

The damage caused by crypto-ransomware, due to encryption, is difficult to revert and cause data losses. In this paper, a machine learning (ML) classifier was built to early detect ransomware (called crypto-ransomware) that uses cryptography by program behavior. If a signature-based detection was missed, a behavior-based detector can be the last line of defense to detect and contain the damages. We find that input/output activities of ransomware and the file-content entropy are unique traits to detect crypto-ransomware. A deep-learning (DL) classifier can detect ransomware with a high accuracy and a low false positive rate. We conduct an adversarial research against the models generated. We use simulated ransomware programs to launch a gray-box analysis to probe the weakness of ML classifiers and to improve model robustness. In addition to accuracy and resiliency, trustworthiness is the other key criteria for a quality detector. Making sure that the correct information was used for inference is important for a security application. The Integrated Gradient method was used to explain the deep learning model and also to reveal why false negatives evade the detection. The approaches to build and to evaluate a real-world detector were demonstrated and discussed.


Automating Botnet Detection with Graph Neural Networks

arXiv.org Machine Learning

Botnets are now a major source for many network attacks, such as DDoS attacks and spam. However, most traditional detection methods heavily rely on heuristically designed multi-stage detection criteria. In this paper, we consider the neural network design challenges of using modern deep learning techniques to learn policies for botnet detection automatically. To generate training data, we synthesize botnet connections with different underlying communication patterns overlaid on large-scale real networks as datasets. To capture the important hierarchical structure of centralized botnets and the fast-mixing structure for decentralized botnets, we tailor graph neural networks (GNN) to detect the properties of these structures. Experimental results show that GNNs are better able to capture botnet structure than previous non-learning methods when trained with appropriate data, and that deeper GNNs are crucial for learning difficult botnet topologies. We believe our data and studies can be useful for both the network security and graph learning communities.


TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

arXiv.org Machine Learning

Sentiment Analysis is a branch of Affective Computing usually considered a binary classification task. In this line of reasoning, Sentiment Analysis can be applied in several contexts to classify the attitude expressed in text samples, for example, movie reviews, sarcasm, among others. A common approach to represent text samples is the use of the Vector Space Model to compute numerical feature vectors consisting of the weight of terms. The most popular term weighting scheme is TF-IDF (Term Frequency - Inverse Document Frequency). It is an Unsupervised Weighting Scheme (UWS) since it does not consider the class information in the weighting of terms. Apart from that, there are Supervised Weighting Schemes (SWS), which consider the class information on term weighting calculation. Several SWS have been recently proposed, demonstrating better results than TF-IDF. In this scenario, this work presents a comparative study on different term weighting schemes and proposes a novel supervised term weighting scheme, named as TF-IDFC-RF (Term Frequency - Inverse Document Frequency in Class - Relevance Frequency). The effectiveness of TF-IDFC-RF is validated with SVM (Support Vector Machine) and NB (Naive Bayes) classifiers on four commonly used Sentiment Analysis datasets. TF-IDFC-RF outperforms all other weighting schemes and achieves F1 results of more than 99.9% on all datasets with SVM classifier.


Inline Detection of DGA Domains Using Side Information

arXiv.org Machine Learning

Malware applications typically use a command and control (C&C) server to manage bots to perform malicious activities. Domain Generation Algorithms (DGAs) are popular methods for generating pseudo-random domain names that can be used to establish a communication between an infected bot and the C&C server. In recent years, machine learning based systems have been widely used to detect DGAs. There are several well known state-of-the-art classifiers in the literature that can detect DGA domain names in real-time applications with high predictive performance. However, these DGA classifiers are highly vulnerable to adversarial attacks in which adversaries purposely craft domain names to evade DGA detection classifiers. In our work, we focus on hardening DGA classifiers against adversarial attacks. To this end, we train and evaluate state-of-the-art deep learning and random forest (RF) classifiers for DGA detection using side information that is harder for adversaries to manipulate than the domain name itself. Additionally, the side information features are selected such that they are easily obtainable in practice to perform inline DGA detection. The performance and robustness of these models is assessed by exposing them to one day of real-traffic data as well as domains generated by adversarial attack algorithms. We found that the DGA classifiers that rely on both the domain name and side information have high performance and are more robust against adversaries.