Goto

Collaborating Authors

 Decision Tree Learning


Network Activities Recognition and Analysis Based on Supervised Machine Learning Classification Methods Using J48 and Na\"ive Bayes Algorithm

arXiv.org Artificial Intelligence

Network activities recognition has always been a significant component of intrusion detection. However, with the increasing network traffic flow and complexity of network behavior, it is becoming more and more difficult to identify the specific behavior quickly and accurately by user network monitoring software. It also requires the system security staff to pay close attention to the latest intrusion monitoring technology and methods. All of these greatly increase the difficulty and complexity of intrusion detection tasks. The application of machine learning methods based on supervised classification technology would help to liberate the network security staff from the heavy and boring tasks. A finetuned model would accurately recognize user behavior, which could provide persistent monitoring with a relative high accuracy and good adaptability. Finally, the results of network activities recognition by J48 and Na\"ive Bayes algorithms are introduced and evaluated.


Lattice partition recovery with dyadic CART

arXiv.org Machine Learning

We study piece-wise constant signals corrupted by additive Gaussian noise over a $d$-dimensional lattice. Data of this form naturally arise in a host of applications, and the tasks of signal detection or testing, de-noising and estimation have been studied extensively in the statistical and signal processing literature. In this paper we consider instead the problem of partition recovery, i.e.~of estimating the partition of the lattice induced by the constancy regions of the unknown signal, using the computationally-efficient dyadic classification and regression tree (DCART) methodology proposed by \citep{donoho1997cart}. We prove that, under appropriate regularity conditions on the shape of the partition elements, a DCART-based procedure consistently estimates the underlying partition at a rate of order $\sigma^2 k^* \log (N)/\kappa^2$, where $k^*$ is the minimal number of rectangular sub-graphs obtained using recursive dyadic partitions supporting the signal partition, $\sigma^2$ is the noise variance, $\kappa$ is the minimal magnitude of the signal difference among contiguous elements of the partition and $N$ is the size of the lattice. Furthermore, under stronger assumptions, our method attains a sharper estimation error of order $\sigma^2\log(N)/\kappa^2$, independent of $ k^*$, which we show to be minimax rate optimal. Our theoretical guarantees further extend to the partition estimator based on the optimal regression tree estimator (ORT) of \cite{chatterjee2019adaptive} and to the one obtained through an NP-hard exhaustive search method. We corroborate our theoretical findings and the effectiveness of DCART for partition recovery in simulations.


Intellige: A User-Facing Model Explainer for Narrative Explanations

arXiv.org Machine Learning

Predictive machine learning models often lack interpretability, resulting in low trust from model end users despite having high predictive performance. While many model interpretation approaches return top important features to help interpret model predictions, these top features may not be well-organized or intuitive to end users, which limits model adoption rates. In this paper, we propose Intellige, a user-facing model explainer that creates user-digestible interpretations and insights reflecting the rationale behind model predictions. Intellige builds an end-to-end pipeline from machine learning platforms to end user platforms, and provides users with an interface for implementing model interpretation approaches and for customizing narrative insights. Intellige is a platform consisting of four components: Model Importer, Model Interpreter, Narrative Generator, and Narrative Exporter. We describe these components, and then demonstrate the effectiveness of Intellige through use cases at LinkedIn. Quantitative performance analyses indicate that Intellige's narrative insights lead to lifts in adoption rates of predictive model recommendations, as well as to increases in downstream key metrics such as revenue when compared to previous approaches, while qualitative analyses indicate positive feedback from end users.


Random forest-based prediction of stroke outcome

#artificialintelligence

We research into the clinical, biochemical and neuroimaging factors associated with the outcome of stroke patients to generate a predictive model using machine learning techniques for prediction of mortality and morbidity 3-months after admission. The dataset consisted of patients with ischemic stroke (IS) and non-traumatic intracerebral hemorrhage (ICH) admitted to Stroke Unit of a European Tertiary Hospital prospectively registered. We identified the main variables for machine learning Random Forest (RF), generating a predictive model that can estimate patient mortality/morbidity according to the following groups: (1) IS + ICH, (2) IS, and (3) ICH. A total of 6022 patients were included: 4922 (mean age 71.9 ± 13.8 years) with IS and 1100 (mean age 73.3 ± 13.1 years) with ICH. NIHSS at 24, 48 h and axillary temperature at admission were the most important variables to consider for evolution of patients at 3-months. IS + ICH group was the most stable for mortality prediction [0.904 ± 0.025 of area under the receiver operating characteristics curve (AUC)]. IS group presented similar results, although variability between experiments was slightly higher (0.909 ± 0.032 of AUC). ICH group was the one in which RF had more problems to make adequate predictions (0.9837 vs. 0.7104 of AUC). There were no major differences between IS and IS + ICH groups according to morbidity prediction (0.738 and 0.755 of AUC) but, after checking normality with a Shapiro Wilk test with the null hypothesis that the data follow a normal distribution, it was rejected with W = 0.93546 (p-value < 2.2e−16). Conditions required for a parametric test do not hold, and we performed a paired Wilcoxon Test assuming the null hypothesis that all the groups have the same performance. The null hypothesis was rejected with a value < 2.2e−16, so there are statistical differences between IS and ICH groups. In conclusion, machine learning algorithms RF can be effectively used in stroke patients for long-term outcome prediction of mortality and morbidity.


SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

arXiv.org Machine Learning

Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.


3 Key Differences Between Random Forests and GBDT

#artificialintelligence

Random forest and gradient boosted decision trees (GBDT) are the two most commonly used machine learning algorithms. Both are ensemble models which means they combine many weak learners to get a strong one. Although both random forest and GBDT use the same weak learner, they are highly different algorithms. In this article, we will focus on 3 key differences between these ensemble techniques. Decision trees are used as the weak learner in both algorithms.


On Explaining Random Forests with SAT

arXiv.org Artificial Intelligence

Random Forest (RFs) are among the most widely used Machine Learning (ML) classifiers. Even though RFs are not interpretable, there are no dedicated non-heuristic approaches for computing explanations of RFs. Moreover, there is recent work on polynomial algorithms for explaining ML models, including naive Bayes classifiers. Hence, one question is whether finding explanations of RFs can be solved in polynomial time. This paper answers this question negatively, by proving that computing one PI-explanation of an RF is D^P-complete. Furthermore, the paper proposes a propositional encoding for computing explanations of RFs, thus enabling finding PI-explanations with a SAT solver. This contrasts with earlier work on explaining boosted trees (BTs) and neural networks (NNs), which requires encodings based on SMT/MILP. Experimental results, obtained on a wide range of publicly available datasets, demontrate that the proposed SAT-based approach scales to RFs of sizes common in practical applications. Perhaps more importantly, the experimental results demonstrate that, for the vast majority of examples considered, the SAT-based approach proposed in this paper significantly outperforms existing heuristic approaches.


Segmentation of high dimensional means over multi-dimensional change points and connections to regression trees

arXiv.org Machine Learning

This article is motivated by the objective of providing a new analytically tractable and fully frequentist framework to characterize and implement regression trees while also allowing a multivariate (potentially high dimensional) response. The connection to regression trees is made by a high dimensional model with dynamic mean vectors over multi-dimensional change axes. Our theoretical analysis is carried out under a single two dimensional change point setting. An optimal rate of convergence of the proposed estimator is obtained, which in turn allows existence of limiting distributions. Distributional behavior of change point estimates are split into two distinct regimes, the limiting distributions under each regime is then characterized, in turn allowing construction of asymptotically valid confidence intervals for $2d$-location of change. All results are obtained under a high dimensional scaling $s\log^2 p=o(T_wT_h),$ where $p$ is the response dimension, $s$ is a sparsity parameter, and $T_w,T_h$ are sampling periods along change axes. We characterize full regression trees by defining a multiple multi-dimensional change point model. Natural extensions of the single $2d$-change point estimation methodology are provided. Two applications, first on segmentation of {\it Infra-red astronomy satellite (IRAS)} data and second to segmentation of digital images are provided. Methodology and theoretical results are supported with monte-carlo simulations.


Fed-EINI: An Efficient and Interpretable Inference Framework for Decision Tree Ensembles in Federated Learning

arXiv.org Artificial Intelligence

The increasing concerns about data privacy and security drives the emergence of a new field of studying privacy-preserving machine learning from isolated data sources, i.e., \textit{federated learning}. Vertical federated learning, where different parties hold different features for common users, has a great potential of driving a more variety of business cooperation among enterprises in different fields. Decision tree models especially decision tree ensembles are a class of widely applied powerful machine learning models with high interpretability and modeling efficiency. However, the interpretability are compromised in these works such as SecureBoost since the feature names are not exposed to avoid possible data breaches due to the unprotected decision path. In this paper, we shall propose Fed-EINI, an efficient and interpretable inference framework for federated decision tree models with only one round of multi-party communication. We shall compute the candidate sets of leaf nodes based on the local data at each party in parallel, followed by securely computing the weight of the only leaf node in the intersection of the candidate sets. We propose to protect the decision path by the efficient additively homomorphic encryption method, which allows the disclosure of feature names and thus makes the federated decision trees interpretable. The advantages of Fed-EINI will be demonstrated through theoretical analysis and extensive numerical results. Experiments show that the inference efficiency is improved by over $50\%$ in average.


Decision Tree -- Implemented from scratch

#artificialintelligence

I want to expand this work into a series of µ-tutorial videos. If you're interested, please subscribe to my newsletter to stay in touch. It is not hard to be under an impression that the world is all about neural networks these days when it comes to making models. Many teams seem to brag about super-cool architectures as if getting enough quality data was straightforward, GPU racks were open 24/7 (for free), and their customer's patience was set to infinity. In this article, we will present one of the most basic machine-learning algorithms known as a Decision Tree.