Accuracy
Identifying Contributing Factors of Occupant Thermal Discomfort in a Smart Building
Basak, Aniruddha (Carnegie Mellon University, Silicon Valley Campus) | Mengshoel, Ole (Carnegie Mellon University, Silicon Valley Campus) | Hosein, Stefan (University of the West Indies, St. Augustine) | Martin, Rodney (NASA Ames Research Center) | Jayakumaran, Jayasudha (Carnegie Mellon University, Silicon Valley Campus) | Morga, Mario Gurrola (Zapopan's Superior Institute of Technology) | Aghav, Ishwari (Carnegie Mellon University, Silicon Valley Campus)
Modeling occupant behavior in smart buildings to reduce energy usage in a more accurate fashion has garnered much recent attention in the literature. Predicting occupant comfort in buildings is a related and challenging problem. In some smart buildings, such as NASA AMES Sustainability Base, there are discrepancies between occupants' actual thermal discomfort and sensors based upon a weighted average of wet bulb, dry bulb, and mean radiant temperature intended to characterize thermal comfort. In this paper we attempt to find other contributing factors to occupant discomfort. For our experiment we use a dataset from a Building Automation System (BAS) in NASA Sustainability Base. We choose one conference room for our experiment and empirically establish the thermal discomfort level for the room's temperature sensor. We use various causality metrics and causal graphs to isolate candidate causes of the target room temperature. And we compare these feature sets according to their predictive capability of future instances of discomfort. Moreover, we establish a trade off between computational and statistical performance of adverse event prediction.
A Novel Method for Mining Semantics from Patterns over ECG Data
Qiu, Zhen (Peking University) | Li, Feifei (Peking University) | Hong, Shenda (Peking University) | Li, Hongyan (Peking University)
In intensive care units (ICU), electrocardiogram (ECG) waveforms show diverse variationsunder different patients' physical conditions.In general, physicians can diagnose patients efficientlyby detecting any disorder of heart rate or rhythm and any change in the morphological pattern of ECG data,which contain underlying semantics.To help physicians better analyze ECG data in a fairly short time,it is essential to develop a novel method for mining semantics from ECG patterns.This paper is the very first time to characterize ECG patterns by using Prefix Scalable Pattern Tree (PSP-Tree).Comparing with similar currently existing methods, PSP-Tree can mine significant semantics,such as scalability, temporality and hierarchy over ECG patterns.We conduct extensive experiments on real ECG data set which are obtained from PhysioBank Community and Beijing No.3 People Hospital.The experiment results show that our method performs more feasibly and effectively than other related work.
Adaptive Ensemble Learning with Confidence Bounds for Personalized Diagnosis
Tekin, Cem (Bilkent University) | Yoon, Jinsung (University of California, Los Angeles) | Schaar, Mihaela van der (University of California, Los Angeles)
With the advances in the field of medical informatics, automated clinical decision support systems are becoming the de facto standard in personalized diagnosis. In order to establish high accuracy and confidence in personalized diagnosis, massive amounts of distributed, heterogeneous, correlated and high-dimensional patient data from different sources such as wearable sensors, mobile applications, Electronic Health Record (EHR) databases etc. need to be processed. This requires learning both locally and globally due to privacy constraints and/or distributed nature of the multi-modal medical data. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner,which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do,these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, we show that our proposed method outperforms all existing ensemble learning techniques, even in the presence of concept drift.
Predicting 30-Day Risk and Cost of "All-Cause" Hospital Readmissions
Sushmita, Shanu (University of Washington, Tacoma) | Khulbe, Garima (University of Washington, Tacoma) | Hasan, Aftab (University of Washington, Tacoma) | Newman, Stacey (University of Washington, Tacoma) | Ravindra, Padmashree (University of Washington, Tacoma) | Roy, Senjuti Basu (University of Washington, Tacoma) | Cock, Martine De (University of Washington, Tacoma) | Teredesai, Ankur (University of Washington, Tacoma)
The hospital readmission rate of patients within 30 days after discharge is broadly accepted as a healthcare quality measure and cost driver in the United States. The ability to estimate hospitalization costs alongside 30 day risk-stratification for such readmissions provides additional benefit for accountable care, now a global issue and foundation for the U.S.~government mandate under the Affordable Care Act. Recent data mining efforts either predict healthcare costs or risk of hospital readmission, but not both. In this paper we present a dual predictive modeling effort that utilizes healthcare data to predict the risk and cost of any hospital readmission (``all-cause''). For this purpose, we explore machine learning algorithms to do accurate predictions of healthcare costs and risk of 30-day readmission.Results on risk prediction for ``all-cause'' readmission compared to the standardized readmission tool (LACE) are promising, and the proposed techniques for cost prediction consistently outperform baseline models and demonstrate substantially lower mean absolute error (MAE).
Automatic Label Correction and Appliance Prioritization in Single Household Electricity Disaggregation
Valovage, Mark (University of Minnesota) | Gini, Maria (University of Minnesota)
Electricity disaggregation focuses on classification ofindividual appliances by monitoring aggregate electricalsignals. In this paper we present a novel algorithmto automatically correct labels, discard contaminatedtraining samples, and boost signal to noise ratio throughhigh frequency noise reduction. We also propose amethod for prioritized classification which classifies applianceswith the most intense signals first. When testedon four houses in Kaggles Belkin dataset, these methodsautomatically relabel over 77% of all training samplesand decrease error rate by an average of 45% in bothreal power and high frequency noise classification.
Active Perception for Cyber Intrusion Detection and Defense
Benton, J. (Smart Information Flow Technologies, LLC) | Goldman, Robert P. (Smart Information Flow Technologies, LLC) | Burstein, Mark (Smart information Flow Technologies, LLC) | Mueller, Joseph (Smart information Flow Technologies, LLC) | Robertson, Paul (DOLL Labs) | Cerys, Dan (DOLL Labs) | Hoffman, Andreas (DOLL Labs) | Bobrow, Rusty (Bobrow Computational Intelligence, LLC)
Most modern network-based intrusion detection systems (IDSs) passively monitor network traffic to identify possible attacks through known vectors. Though useful, this approach has widely known high false positive rates, often causing administrators to suffer from a "cry wolf effect," where they ignore all warnings because so many have been false. In this paper, we focus on a method to reduce this effect using an idea borrowed from computer vision and neuroscience called active perception. Our approach is informed by theoretical ideas from decision theory and recent research results in neuroscience. The active perception agent allocates computational and sensing resources to (approximately) optimize its Value of Information. To do this, it draws on models to direct sensors towards phenomena of greatest interest to inform decisions about cyber defense actions. By identifying critical network assets, the organization's mission measures self-interest (and value of information). This model enables the system to follow leads from inexpensive, inaccurate alerts with targeted use of expensive, accurate sensors. This allows the deployment of sensors to build structured interpretations of situations. From these, an organization can meet mission-centered decision-making requirements with calibrated responses proportional to the likelihood of true detection and degree of threat.
Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply
Bonatti, Rogerio (Universidade de Sao Paulo) | Paula, Arthur G. de (Universidade de Sao Paulo) | Lamarca, Victor S. (Universidade de Sao Paulo) | Cozman, Fabio G. (Universidade de Sao Paulo)
We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and Support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with non-lemmatized selection of verbs and nouns, adjectives and adverbs was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.
Discovering Human and Machine Readable Descriptions of Malware Families
Anderson, Blake (Cisco Systems, Inc.) | McGrew, David (Cisco Systems, Inc.) | Paul, Subharthi (Cisco Systems, Inc.)
While an immense amount of work has gone into novel clustering algorithms, little work has focused on developing compact, domain-specific explanations for the results of the clustering algorithms. Attaching semantic meaning to a cluster has numerous benefits, including the ability for such a description to be both human and machine readable. In this paper, we assume that the clusters are given to us, and find the minimal set of features that can differentiate one cluster from the remaining set of samples. We formulate this problem as an integer linear program. By using samples not belonging to the cluster in the optimization formulation, the resulting description will be minimal and contain no false positives. The efficacy of this method is demonstrated on simulation data and real-world malware data run in a sandbox that collects behavioral characteristics. In the case of malware, once it has been clustered, it would have been sent to a reverse engineer who would have been tasked with creating the actual meaning of the clustering results and disseminating this information through signatures or indicators of compromise. This is a time-consuming process that can take hours to weeks depending on the complexity of the malware family. The methods presented in this paper automatically generate optimal signatures, which can then be quickly propagated to help contain the spread of a malware family.
Validation of Matching
Le, Ya, Bax, Eric, Barbieri, Nicola, Soriano, David Garcia, Mehta, Jitesh, Li, James
Our matching problem setting is similar to the transductive setting for classification, from Vapnik [9], where there is a set of training examples with known inputs and class labels and a set of working examples with known inputs and unknown class labels, and the goal is to use the available training and working data to develop a classifier that classifies the working examples with a low error rate. For results on validation of network classifiers (rather than reconciliation algorithms) in transductive settings, refer to [10] and [11]. For theory and insight on why collective classification succeeds in general settings and validation methods for it, refer to [12]. For network reconciliation, we assume that we know some network data, consisting of some node data and the links, for both networks involved in the matching, and our goal is to use that network data to match nodes as accurately as possible between the networks. This paper presents a technique to compute probably approximately correct (PAC) bounds on the precision and recall of matching algorithms.
7 Important Model Evaluation Error Metrics Everyone should know
Predictive Modeling works on constructive feedback principle. Get feedback from metrics, make improvements and continue until you achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their capability to discriminate among model results. Once they are finished building a model, they hurriedly map predicted values on unseen data. This is an incorrect approach. Simply, building a predictive model is not your motive. But, creating and selecting a model which gives high accuracy on out of sample data.