Goto

Collaborating Authors

 Accuracy


Interpretable Machine Learning Approaches to Prediction of Chronic Homelessness

arXiv.org Artificial Intelligence

A 2016 report claims that annually upwards of 235 000 Canadians endure periods of homelessness, with approximately 35 000 individuals lacking a place to stay each night [1]. Between 2005 and 2014, there was a downward trend in the total number of Canadians using shelters; however, the occupancy rates of shelters has been increasing [1]. One factor accounting for this ongoing decrease in the number of homeless individuals paired with an increase in shelter occupancy is an increase in chronic homelessness. London's Homeless Prevention division identifies an individual as chronically homelessness if they have spent 6 or more months ( 180 days) of the last year in a shelter, which was based on the definition of chronic homelessness outlined by the Canadian government's homelessness strategy directives [2]. In addition to this trend, the demographics of homelessness are changing in Canada. In preceding decades, older, single males are over-represented in the homeless population; in contrast, the homeless population of today is increasingly diverse, with families, women, and youth comprising a greater fraction [1].


DART: Data Addition and Removal Trees

arXiv.org Machine Learning

How can we update data for a machine learning model after it has already trained on that data? In this paper, we introduce DART, a variant of random forests that supports adding and removing training data with minimal retraining. Data updates in DART are exact, meaning that adding or removing examples from a DART model yields exactly the same model as retraining from scratch on updated data. DART uses two techniques to make updates efficient. The first is to cache data statistics at each node and training data at each leaf, so that only the necessary subtrees are retrained. The second is to choose the split variable randomly at the upper levels of each tree, so that the choice is completely independent of the data and never needs to change. At the lower levels, split variables are chosen to greedily maximize a split criterion such as Gini index or mutual information. By adjusting the number of random-split levels, DART can trade off between more accurate predictions and more efficient updates. In experiments on ten real-world datasets and one synthetic dataset, we find that DART is orders of magnitude faster than retraining from scratch while sacrificing very little in terms of predictive performance.


Learning Interpretable Characteristic Kernels via Decision Forests

arXiv.org Machine Learning

Decision forests are popular tools for classification and regression. These forests naturally produce proximity matrices measuring how often each pair of observations lies in the same leaf node. It has been demonstrated that these proximity matrices can be thought of as kernels, connecting the decision forest literature to the extensive kernel machine literature. While other kernels are known to have strong theoretical properties such as being characteristic, no similar result is available for any decision forest based kernel. In this manuscript, we prove that the decision forest induced proximity can be made characteristic, which can be used to yield a universally consistent statistic for testing independence. We demonstrate the performance of the induced kernel on a suite of 20 high-dimensional independence test settings. We also show how this learning kernel offers insights into relative feature importance. The decision forest induced kernel typically achieves substantially higher testing power than existing popular methods in statistical tests.


Response to Comment on "No consistent ENSO response to volcanic forcing over the last millennium"

Science

Robock claims that our analysis fails to acknowledge that pan-tropical surface cooling caused by large volcanic eruptions may mask El Niรฑo warming at our central Pacific site, potentially obscuring a volcanoโ€“El Niรฑo connection suggested in previous studies. Although observational support for a dynamical response linking volcanic cooling to El Niรฑo remains ambiguous, Robock raises some important questions about our study that we address here. Modeling studies suggest that the El Niรฑoโ€“Southern Oscillation (ENSO) is sensitive to sulfate aerosol forcing associated with explosive volcanism, yet observational support for a dynamical chain of events linking large volcanic cooling to El Niรฑo occurrences remains inconclusive. In Dee et al. (1), we used absolutely dated fossil corals from the central tropical Pacific to test ENSO's response to large volcanic eruptions. Superposed epoch analysis reveals a weak tendency for an El Niรฑoโ€“like response in the year after an eruption, but this response is not statistically significant, nor does it appear after the outsized 1257 Samalas eruption.


Machine Learning Applications in Misuse and Anomaly Detection

arXiv.org Artificial Intelligence

Machine learning and data mining algorithms play important roles in designing intrusion detection systems. Based on their approaches toward the detection of attacks in a network, intrusion detection systems can be broadly categorized into two types. In the misuse detection systems, an attack in a system is detected whenever the sequence of activities in the network matches with a known attack signature. In the anomaly detection approach, on the other hand, anomalous states in a system are identified based on a significant difference in the state transitions of the system from its normal states. This chapter presents a comprehensive discussion on some of the existing schemes of intrusion detection based on misuse detection, anomaly detection and hybrid detection approaches. Some future directions of research in the design of algorithms for intrusion detection are also identified.


Network Traffic Analysis based IoT Device Identification

arXiv.org Artificial Intelligence

Device identification is the process of identifying a device on Internet without using its assigned network or other credentials. The sharp rise of usage in Internet of Things (IoT) devices has imposed new challenges in device identification due to a wide variety of devices, protocols and control interfaces. In a network, conventional IoT devices identify each other by utilizing IP or MAC addresses, which are prone to spoofing. Moreover, IoT devices are low power devices with minimal embedded security solution. To mitigate the issue in IoT devices, fingerprint (DFP) for device identification can be used. DFP identifies a device by using implicit identifiers, such as network traffic (or packets), radio signal, which a device used for its communication over the network. These identifiers are closely related to the device hardware and software features. In this paper, we exploit TCP/IP packet header features to create a device fingerprint utilizing device originated network packets. We present a set of three metrics which separate some features from a packet which contribute actively for device identification. To evaluate our approach, we used publicly accessible two datasets. We observed the accuracy of device genre classification 99.37% and 83.35% of accuracy in the identification of an individual device from IoT Sentinel dataset. However, using UNSW dataset device type identification accuracy reached up to 97.78%.


Large-scale nonlinear Granger causality: A data-driven, multivariate approach to recovering directed networks from short time-series data

arXiv.org Machine Learning

To gain insight into complex systems it is a key challenge to infer nonlinear causal directional relations from observational time-series data. Specifically, estimating causal relationships between interacting components in large systems with only short recordings over few temporal observations remains an important, yet unresolved problem. Here, we introduce a large-scale Nonlinear Granger Causality (lsNGC) approach for inferring directional, nonlinear, multivariate causal interactions between system components from short high-dimensional time-series recordings. By modeling interactions with nonlinear state-space transformations from limited observational data, lsNGC identifies casual relations with no explicit a priori assumptions on functional interdependence between component time-series in a computationally efficient manner. Additionally, our method provides a mathematical formulation revealing statistical significance of inferred causal relations. We extensively study the ability of lsNGC to recovering network structure from two-node to thirty-four node chaotic time-series systems. Our results suggest that lsNGC captures meaningful interactions from limited observational data, where it performs favorably when compared to traditionally used methods. Finally, we demonstrate the applicability of lsNGC to estimating causality in large, real-world systems by inferring directional nonlinear, multivariate causal relationships among a large number of relatively short time-series acquired from functional Magnetic Resonance Imaging (fMRI) data of the human brain.


A Gentle Introduction to Self-Training and Semi-Supervised Learning

#artificialintelligence

When it comes to machine learning classification tasks, the more data available to train algorithms, the better. In supervised learning, this data must be labeled with respect to the target class -- otherwise, these algorithms wouldn't be able to learn the relationships between the independent and target variables. So, what if we only have enough time and money to label some of a large data set, and choose to leave the rest unlabeled? Can this unlabeled data somehow be used in a classification algorithm? This is where semi-supervised learning comes in.


On the Identification of Fair Auditors to Evaluate Recommender Systems based on a Novel Non-Comparative Fairness Notion

arXiv.org Artificial Intelligence

Decision-support systems are information systems that offer support to people's decisions in various applications such as judiciary, real-estate and banking sectors. Lately, these support systems have been found to be discriminatory in the context of many practical deployments. In an attempt to evaluate and mitigate these biases, algorithmic fairness literature has been nurtured using notions of comparative justice, which relies primarily on comparing two/more individuals or groups within the society that is supported by such systems. However, such a fairness notion is not very useful in the identification of fair auditors who are hired to evaluate latent biases within decision-support systems. As a solution, we introduce a paradigm shift in algorithmic fairness via proposing a new fairness notion based on the principle of non-comparative justice. Assuming that the auditor makes fairness evaluations based on some (potentially unknown) desired properties of the decision-support system, the proposed fairness notion compares the system's outcome with that of the auditor's desired outcome. We show that the proposed fairness notion also provides guarantees in terms of comparative fairness notions by proving that any system can be deemed fair from the perspective of comparative fairness (e.g. individual fairness and statistical parity) if it is non-comparatively fair with respect to an auditor who has been deemed fair with respect to the same fairness notions. We also show that the converse holds true in the context of individual fairness. A brief discussion is also presented regarding how our fairness notion can be used to identify fair and reliable auditors, and how we can use them to quantify biases in decision-support systems.


Machine Intelligence for Outcome Predictions of Trauma Patients During Emergency Department Care

arXiv.org Artificial Intelligence

Trauma mortality results from a multitude of non-linear dependent risk factors including patient demographics, injury characteristics, medical care provided, and characteristics of medical facilities; yet traditional approach attempted to capture these relationships using rigid regression models. We hypothesized that a transfer learning based machine learning algorithm could deeply understand a trauma patient's condition and accurately identify individuals at high risk for mortality without relying on restrictive regression model criteria. Anonymous patient visit data were obtained from years 2007-2014 of the National Trauma Data Bank. Patients with incomplete vitals, unknown outcome, or missing demographics data were excluded. All patient visits occurred in U.S. hospitals, and of the 2,007,485 encounters that were retrospectively examined, 8,198 resulted in mortality (0.4%). The machine intelligence model was evaluated on its sensitivity, specificity, positive and negative predictive value, and Matthews Correlation Coefficient. Our model achieved similar performance in age-specific comparison models and generalized well when applied to all ages simultaneously. While testing for confounding factors, we discovered that excluding fall-related injuries boosted performance for adult trauma patients; however, it reduced performance for children. The machine intelligence model described here demonstrates similar performance to contemporary machine intelligence models without requiring restrictive regression model criteria or extensive medical expertise.