Goto

Collaborating Authors

 Performance Analysis


Tuning Parameters for Boosting/Bagging/Random Forest • /r/MachineLearning

@machinelearnbot

Random forests usually performs quite well with the default settings. That is bootstrap resampling scheme, unpruned trees, as many trees as possible to get results in a reasonable amount of time and sqrt(#features) tried per split (mtry parameter). Then you can try to optimize the choices by checking the results on out of bag data (those each tree didnt train on because of the resampling scheme). If you have very unbalanced classes you should decide a measure of interest (such as true positive ratio) and try to tune the related parameter. Out of bag data can be trusted almost as a proper cross validation if you use enough trees and bootstrap resampling.


Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce

arXiv.org Machine Learning

In this paper, we propose a one-pass algorithm on MapReduce for penalized linear regression \[f_\lambda(\alpha, \beta) = \|Y - \alpha\mathbf{1} - X\beta\|_2^2 + p_{\lambda}(\beta)\] where $\alpha$ is the intercept which can be omitted depending on application; $\beta$ is the coefficients and $p_{\lambda}$ is the penalized function with penalizing parameter $\lambda$. $f_\lambda(\alpha, \beta)$ includes interesting classes such as Lasso, Ridge regression and Elastic-net. Compared to latest iterative distributed algorithms requiring multiple MapReduce jobs, our algorithm achieves huge performance improvement; moreover, our algorithm is exact compared to the approximate algorithms such as parallel stochastic gradient decent. Moreover, what our algorithm distinguishes with others is that it trains the model with cross validation to choose optimal $\lambda$ instead of user specified one. Key words: penalized linear regression, lasso, elastic-net, ridge, MapReduce


[Question] help in Ridge regression • /r/MachineLearning

@machinelearnbot

This is why Ridge regression is a linear model, the model is a linear combination of its variables/weights.


Identifying Contributing Factors of Occupant Thermal Discomfort in a Smart Building

AAAI Conferences

Modeling occupant behavior in smart buildings to reduce energy usage in a more accurate fashion has garnered much recent attention in the literature. Predicting occupant comfort in buildings is a related and challenging problem. In some smart buildings, such as NASA AMES Sustainability Base, there are discrepancies between occupants' actual thermal discomfort and sensors based upon a weighted average of wet bulb, dry bulb, and mean radiant temperature intended to characterize thermal comfort. In this paper we attempt to find other contributing factors to occupant discomfort. For our experiment we use a dataset from a Building Automation System (BAS) in NASA Sustainability Base. We choose one conference room for our experiment and empirically establish the thermal discomfort level for the room's temperature sensor. We use various causality metrics and causal graphs to isolate candidate causes of the target room temperature. And we compare these feature sets according to their predictive capability of future instances of discomfort. Moreover, we establish a trade off between computational and statistical performance of adverse event prediction.


A Novel Method for Mining Semantics from Patterns over ECG Data

AAAI Conferences

In intensive care units (ICU), electrocardiogram (ECG) waveforms show diverse variationsunder different patients' physical conditions.In general, physicians can diagnose patients efficientlyby detecting any disorder of heart rate or rhythm and any change in the morphological pattern of ECG data,which contain underlying semantics.To help physicians better analyze ECG data in a fairly short time,it is essential to develop a novel method for mining semantics from ECG patterns.This paper is the very first time to characterize ECG patterns by using Prefix Scalable Pattern Tree (PSP-Tree).Comparing with similar currently existing methods, PSP-Tree can mine significant semantics,such as scalability, temporality and hierarchy over ECG patterns.We conduct extensive experiments on real ECG data set which are obtained from PhysioBank Community and Beijing No.3 People Hospital.The experiment results show that our method performs more feasibly and effectively than other related work.


Adaptive Ensemble Learning with Confidence Bounds for Personalized Diagnosis

AAAI Conferences

With the advances in the field of medical informatics, automated clinical decision support systems are becoming the de facto standard in personalized diagnosis. In order to establish high accuracy and confidence in personalized diagnosis, massive amounts of distributed, heterogeneous, correlated and high-dimensional patient data from different sources such as wearable sensors, mobile applications, Electronic Health Record (EHR) databases etc. need to be processed. This requires learning both locally and globally due to privacy constraints and/or distributed nature of the multi-modal medical data. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner,which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do,these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, we show that our proposed method outperforms all existing ensemble learning techniques, even in the presence of concept drift.


Predicting 30-Day Risk and Cost of "All-Cause" Hospital Readmissions

AAAI Conferences

The hospital readmission rate of patients within 30 days after discharge is broadly accepted as a healthcare quality measure and cost driver in the United States. The ability to estimate hospitalization costs alongside 30 day risk-stratification for such readmissions provides additional benefit for accountable care, now a global issue and foundation for the U.S.~government mandate under the Affordable Care Act. Recent data mining efforts either predict healthcare costs or risk of hospital readmission, but not both. In this paper we present a dual predictive modeling effort that utilizes healthcare data to predict the risk and cost of any hospital readmission (``all-cause''). For this purpose, we explore machine learning algorithms to do accurate predictions of healthcare costs and risk of 30-day readmission.Results on risk prediction for ``all-cause'' readmission compared to the standardized readmission tool (LACE) are promising, and the proposed techniques for cost prediction consistently outperform baseline models and demonstrate substantially lower mean absolute error (MAE).


Automatic Label Correction and Appliance Prioritization in Single Household Electricity Disaggregation

AAAI Conferences

Electricity disaggregation focuses on classification ofindividual appliances by monitoring aggregate electricalsignals. In this paper we present a novel algorithmto automatically correct labels, discard contaminatedtraining samples, and boost signal to noise ratio throughhigh frequency noise reduction. We also propose amethod for prioritized classification which classifies applianceswith the most intense signals first. When testedon four houses in Kaggles Belkin dataset, these methodsautomatically relabel over 77% of all training samplesand decrease error rate by an average of 45% in bothreal power and high frequency noise classification.


Active Perception for Cyber Intrusion Detection and Defense

AAAI Conferences

Most modern network-based intrusion detection systems (IDSs) passively monitor network traffic to identify possible attacks through known vectors. Though useful, this approach has widely known high false positive rates, often causing administrators to suffer from a "cry wolf effect," where they ignore all warnings because so many have been false. In this paper, we focus on a method to reduce this effect using an idea borrowed from computer vision and neuroscience called active perception. Our approach is informed by theoretical ideas from decision theory and recent research results in neuroscience. The active perception agent allocates computational and sensing resources to (approximately) optimize its Value of Information. To do this, it draws on models to direct sensors towards phenomena of greatest interest to inform decisions about cyber defense actions. By identifying critical network assets, the organization's mission measures self-interest (and value of information). This model enables the system to follow leads from inexpensive, inaccurate alerts with targeted use of expensive, accurate sensors. This allows the deployment of sensors to build structured interpretations of situations. From these, an organization can meet mission-centered decision-making requirements with calibrated responses proportional to the likelihood of true detection and degree of threat.


Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply

AAAI Conferences

We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and Support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with non-lemmatized selection of verbs and nouns, adjectives and adverbs was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.