Accuracy
[-1,1]: Random Forests and Decision Trees * BioinformationX
Here we will build a Python(-ic/-esque) Random Forest. Since with python everything is made so easy that you can easily build very complex machines out from one or two libraries, it is better to delve into basic topics before dipping our nose into untameable beasts. Let us start from a single "decision tree" (a simple problem). After that we will extend our knowledge and learn to build a Random Forest and an application to a real problem. To warm up, we will start with a toy problem, with only two features and two classes.
BeSense: Leveraging WiFi Channel Data and Computational Intelligence for Behavior Analysis
Gu, Yu, Zhang, Xiang, Liu, Zhi, Ren, Fuji
The ever evolving informatics technology has gradually bounded human and computer in a compact way. Understanding user behavior becomes a key enabler in many fields such as sedentary-related healthcare, human-computer interaction (HCI) and affective computing. Traditional sensor-based and vision-based user behavior analysis approaches are obtrusive in general, hindering their usage in realworld. Therefore, in this article, we first introduce WiFi signal as a new source instead of sensor and vision for unobtrusive user behaviors analysis. Then we design BeSense, a contactless behavior analysis system leveraging signal processing and computational intelligence over WiFi channel state information (CSI). We prototype BeSense on commodity low-cost WiFi devices and evaluate its performance in realworld environments. Experimental results have verified its effectiveness in recognizing user behaviors.
Deep Bayesian Gaussian Processes for Uncertainty Estimation in Electronic Health Records
Li, Yikuan, Rao, Shishir, Hassaine, Abdelaali, Ramakrishnan, Rema, Zhu, Yajie, Canoy, Dexter, Salimi-Khorshidi, Gholamreza, Lukasiewicz, Thomas, Rahimi, Kazem
One major impediment to the wider use of deep learning for clinical decision making is the difficulty of assigning a level of confidence to model predictions. Currently, deep Bayesian neural networks and sparse Gaussian processes are the main two scalable uncertainty estimation methods. However, deep Bayesian neural network suffers from lack of expressiveness, and more expressive models such as deep kernel learning, which is an extension of sparse Gaussian process, captures only the uncertainty from the higher level latent space. Therefore, the deep learning model under it lacks interpretability and ignores uncertainty from the raw data. In this paper, we merge features of the deep Bayesian learning framework with deep kernel learning to leverage the strengths of both methods for more comprehensive uncertainty estimation. Through a series of experiments on predicting the first incidence of heart failure, diabetes and depression applied to large-scale electronic medical records, we demonstrate that our method is better at capturing uncertainty than both Gaussian processes and deep Bayesian neural networks in terms of indicating data insufficiency and distinguishing true positive and false positive predictions, with a comparable generalisation performance. Furthermore, by assessing the accuracy and area under the receiver operating characteristic curve over the predictive probability, we show that our method is less susceptible to making overconfident predictions, especially for the minority class in imbalanced datasets. Finally, we demonstrate how uncertainty information derived by the model can inform risk factor analysis towards model interpretability.
robROSE: A robust approach for dealing with imbalanced data in fraud detection
Baesens, Bart, Höppner, Sebastiaan, Ortner, Irene, Verdonck, Tim
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than 0.5% of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that solve the problem of imbalanced data by creating synthetic samples that mimic the minority class. A frequent problem when analyzing real data is the presence of anomalies or outliers. When such atypical observations are present in the data, most oversampling techniques are prone to create synthetic samples that distort the detection algorithm and spoil the resulting analysis. A useful tool for anomaly detection is robust statistics, which aims to find the outliers by first fitting the majority of the data and then flagging data observations that deviate from it. In this paper, we present a robust version of ROSE, called robROSE, which combines several promising approaches to cope simultaneously with the problem of imbalanced data and the presence of outliers. The proposed method achieves to enhance the presence of the fraud cases while ignoring anomalies. The good performance of our new sampling technique is illustrated on simulated and real data sets and it is shown that robROSE can provide better insight in the structure of the data. The source code of the robROSE algorithm is made freely available.
Anomaly Detection with MIDAS
Anomaly detection in graphs is a severe problem finding strange behaviors in systems, like intrusion detection, fake ratings, and financial fraud. To minimize the effect of malicious activities as soon as possible, we need to detect anomalies in real-time to identify an incoming edge and decide if it is anomalous or not. Existing methods, process edge streams in an online manner and can miss a large amount of suspicious activity; in contrast to this, MIDAS detects microclusters anomalies in edge streams using constant time and memory, providing theoretical bounds on the false positive probability. Main MIDAS contributions are: 1. Streaming Microcluster Detection, novel streaming approach for detecting microcluster anomalies; 2. Theoretical Guarantee, on the false positive probability of MIDAS; 3. Effectiveness, MIDAS' experimental results show that MIDAS outperforms the baseline approaches by 42%-48% accuracy and processes the data 162–644 times faster. If we compare MIDAS to previous approaches that detect anomalies in edge streams, we see that MIDAS includes more features like Microcluster Detection and Guarantee on false-positive probability, keeping the other elements of other approaches.
Unlocking the Power of Artificial Intelligence and Big Data in Medicine
Most of the daily news and recently published scientific papers on research, innovations, and applications in artificial intelligence (AI) refer to what is known as machine learning--algorithms using massive amounts of data and various methodologies to find patterns, support decisions, make predictions, or, for the deep learning part, self-identify important features in data. However, AI is a complex concept to grasp, and most people have little understanding of what it really is. AI was founded as an academic discipline in 1956 and, despite its youth, already has a rich history [1,2]. In more than 60 years of exploration and progress, AI has become a large field of research and development involving multidisciplinary approaches to address many challenges, from theoretical frameworks, methods, and tools to real implementations, risk analysis, and impact measures. The definition of AI is a moving target and changes over time with the evolution of the field. Since its early days, the field of AI has allowed the development of many techniques supporting decision support and prediction, as it is usually made by humans. As early as 1958, a perceptron was expected to be able "to walk, talk, see, write, reproduce itself and be conscious of its existence," which led a large scientific controversy between neural network and symbolic reasoning approaches [3].
Deep Synthetic Minority Over-Sampling Technique
Mansourifar, Hadi, Shi, Weidong
Synthetic Minority Over-sampling Technique (SMOTE) is the most popular over-sampling method. However, its random nature makes the synthesized data and even imbalanced classification results unstable. It means that in case of running SMOTE n different times, n different synthesized in-stances are obtained with n different classification results. To address this problem, we adapt the SMOTE idea in deep learning architecture. In this method, a deep neural network regression model is used to train the inputs and outputs of traditional SMOTE. Inputs of the proposed deep regression model are two randomly chosen data points which are concatenated to form a double size vector. The outputs of this model are corresponding randomly interpolated data points between two randomly chosen vectors with original dimension. The experimental results show that, Deep SMOTE can outperform traditional SMOTE in terms of precision, F1 score and Area Under Curve (AUC) in majority of test cases.
Scaling up Kernel Ridge Regression via Locality Sensitive Hashing
Kapralov, Michael, Nouri, Navid, Razenshteyn, Ilya, Velingker, Ameya, Zandieh, Amir
Random binning features, introduced in the seminal paper of Rahimi and Recht (2007), are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way of approximating the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes, such as the Gaussian kernel and Matern kernel. In this paper, we introduce a simple weighted version of random binning features and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression. Experiments on large scale regression datasets show that our method outperforms the accuracy of random Fourier features method.
BoostTree and BoostForest for Ensemble Learning
Zhao, Changming, Wu, Dongrui, Huang, Jian, Yuan, Ye, Zhang, Hai-Tao
Bootstrap aggregation (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite learner. This article proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree by gradient boosting, which trains a linear or nonlinear model at each node. When a new sample comes in, BoostTree first sorts it down to a leaf, then computes the final prediction by summing up the outputs of all models along the path from the root node to that leaf. BoostTree achieves high randomness (diversity) by sampling its parameters randomly from a parameter pool, and selecting a subset of features randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest is compared with four classical ensemble learning approaches on 30 classification and regression datasets, demonstrating that it can generate more accurate and more robust composite learners.
Understanding Voting Outcomes through Data Science
After the surprising results of the 2016 presidential election, I wanted to better understand the socio-economic and cultural factors that played a role in voting behavior. With the election results in the books, I thought it would be fun to reverse-engineer a predictive model of voting behavior based on some of the widely available county-level data sets. For example, if you want to answer the question "how could the election have been different if the percentage of people with at least a bachelor's degree had been 2% higher nationwide?" you can simply toggle that parameter up to 1.02 and click "Submit" to find out. The predictions are driven by a random forest classification model that has been tuned and trained on 71 distinct county-level attributes. Using real data, the model has a predictive accuracy of 94.6% and an ROC AUC score of 96%.