Decision Tree Learning
An Open-Source Tool for Classification Models in Resource-Constrained Hardware
da Silva, Lucas Tsutsui, Souza, Vinicius M. A., Batista, Gustavo E. A. P. A.
Abstract-- Applications that need to sense, measure, and gather real-time information from the environment frequently face three main restrictions: power consumption, cost, and lack of infrastructure. Most of the challenges imposed by these limitations can be better addressed by embedding Machine Learning (ML) classifiers in the hardware that senses the environment, creating smart sensors able to interpret the low-level data stream. However, for this approach to be cost-effective, we need highly efficient classifiers suitable to execute in unresourceful hardware, such as low-power microcontrollers. In this paper, we present an open-source tool named EmbML - Embedded Machine Learning that implements a pipeline to develop classifiers for resource-constrained hardware. We describe its implementation details and provide a comprehensive analysis of its classifiers considering accuracy, classification time, and memory usage. Moreover, we compare the performance of its classifiers with classifiers produced by related tools to demonstrate that our tool provides a diverse set of classification algorithms that are both compact and accurate. Therefore, these smart sensors are more powerefficient since they eliminate the need for communicating all the raw data. PPLICATIONS that need to sense, measure, and gather real-time information from the environment frequently of interest - e.g., a dry soil crop area that needs watering or face three main restrictions [1]: power consumption, cost, the capture of a disease-vector mosquito.
Comparing interpretability and explainability for feature selection
Dunn, Jack, Mingardi, Luca, Zhuo, Ying Daisy
A common approach for feature selection is to examine the variable importance scores for a machine learning model, as a way to understand which features are the most relevant for making predictions. Given the significance of feature selection, it is crucial for the calculated importance scores to reflect reality. Falsely overestimating the importance of irrelevant features can lead to false discoveries, while underestimating importance of relevant features may lead us to discard important features, resulting in poor model performance. Additionally, black-box models like XGBoost provide state-of-the art predictive performance, but cannot be easily understood by humans, and thus we rely on variable importance scores or methods for explainability like SHAP to offer insight into their behavior. In this paper, we investigate the performance of variable importance as a feature selection method across various black-box and interpretable machine learning methods. We compare the ability of CART, Optimal Trees, XGBoost and SHAP to correctly identify the relevant subset of variables across a number of experiments. The results show that regardless of whether we use the native variable importance method or SHAP, XGBoost fails to clearly distinguish between relevant and irrelevant features. On the other hand, the interpretable methods are able to correctly and efficiently identify irrelevant features, and thus offer significantly better performance for feature selection.
Learning stochastic decision trees
Blanc, Guy, Lange, Jane, Tan, Li-Yang
We give a quasipolynomial-time algorithm for learning stochastic decision trees that is optimally resilient to adversarial noise. Given an $\eta$-corrupted set of uniform random samples labeled by a size-$s$ stochastic decision tree, our algorithm runs in time $n^{O(\log(s/\varepsilon)/\varepsilon^2)}$ and returns a hypothesis with error within an additive $2\eta + \varepsilon$ of the Bayes optimal. An additive $2\eta$ is the information-theoretic minimum. Previously no non-trivial algorithm with a guarantee of $O(\eta) + \varepsilon$ was known, even for weaker noise models. Our algorithm is furthermore proper, returning a hypothesis that is itself a decision tree; previously no such algorithm was known even in the noiseless setting.
An Extensive Analytical Approach on Human Resources using Random Forest Algorithm
papineni, Swarajya lakshmi v, Reddy, A. Mallikarjuna, yarlagadda, Sudeepti, Yarlagadda, Snigdha, Akkinen, Haritha
The current job survey shows that most software employees are planning to change their job role due to high pay for recent jobs such as data scientists, business analysts and artificial intelligence fields. The survey also indicated that work life imbalances, low pay, uneven shifts and many other factors also make employees think about changing their work life. In this paper, for an efficient organisation of the company in terms of human resources, the proposed system designed a model with the help of a random forest algorithm by considering different employee parameters. This helps the HR department retain the employee by identifying gaps and helping the organisation to run smoothly with a good employee retention ratio. This combination of HR and data science can help the productivity, collaboration and well-being of employees of the organisation. It also helps to develop strategies that have an impact on the performance of employees in terms of external and social factors.
Accelerating Entrepreneurial Decision-Making Through Hybrid Intelligence
AI - Artificial Intelligence AGI - Artificial General Intelligence ANN - Artificial Neural Network ANOVA - Analysis of Variance ANT - Actor Network Theory API - Application Programming Interface APX - Amsterdam Power Exchange AVE - Average Variance Extracted BU - Business Unit CART - Classification and Regression Tree CBMV - Crowd-based Business Model Validation CR - Composite Reliability CT - Computed Tomography CVC - Corporate Venture Capital DR - Design Requirement DP - Design Principle DSR - Design Science Research DSS - Decision Support System EEX - European Energy Exchange FsQCA - Fuzzy-Set Qualitative Comparative Analysis GUI - Graphical User Interface HI-DSS - Hybrid Intelligence Decision Support System HIT - Human Intelligence Task IoT - Internet of Things IS - Information System IT - Information Technology MCC - Matthews Correlation Coefficient ML - Machine Learning OCT - Opportunity Creation Theory OGEMA 2.0 - Open Gateway Energy Management 2.0 OS - Operating System R&D - Research & Development RE - Renewable Energies RQ - Research Question SVM - Support Vector Machine SSD - Solid-State Drive SDK - Software Development Kit TCP/IP - Transmission Control Protocol/Internet Protocol TCT - Transaction Cost Theory UI - User Interface VaR - Value at Risk VC - Venture Capital VPP - Virtual Power Plant Chapter I
Universal Consistency of Decision Trees in High Dimensions
This paper shows that decision trees constructed with Classification and Regression Trees (CART) methodology are universally consistent in an additive model context, even when the number of predictor variables scales exponentially with the sample size, under certain $1$-norm sparsity constraints. The consistency is universal in the sense that there are no a priori assumptions on the distribution of the predictor variables. Amazingly, this adaptivity to (approximate or exact) sparsity is achieved with a single tree, as opposed to what might be expected for an ensemble. Finally, we show that these qualitative properties of individual trees are inherited by Breiman's random forests. Another surprise is that consistency holds even when the "mtry" tuning parameter vanishes as a fraction of the number of predictor variables, thus speeding up computation of the forest. A key step in the analysis is the establishment of an oracle inequality, which precisely characterizes the goodness-of-fit and complexity tradeoff for a misspecified model.
Learning Linear Temporal Properties from Noisy Data: A MaxSAT Approach
Gaglione, Jean-Raphaรซl, Neider, Daniel, Roy, Rajarshi, Topcu, Ufuk, Xu, Zhe
We address the problem of inferring descriptions of system behavior using Linear Temporal Logic (LTL) from a finite set of positive and negative examples. Most of the existing approaches for solving such a task rely on predefined templates for guiding the structure of the inferred formula. The approaches that can infer arbitrary LTL formulas, on the other hand, are not robust to noise in the data. To alleviate such limitations, we devise two algorithms for inferring concise LTL formulas even in the presence of noise. Our first algorithm infers minimal LTL formulas by reducing the inference problem to a problem in maximum satisfiability and then using off-the-shelf MaxSAT solvers to find a solution. To the best of our knowledge, we are the first to incorporate the usage of MaxSAT solvers for inferring formulas in LTL. Our second learning algorithm relies on the first algorithm to derive a decision tree over LTL formulas based on a decision tree learning algorithm. We have implemented both our algorithms and verified that our algorithms are efficient in extracting concise LTL descriptions even in the presence of noise.
Land Cover Classification
Earth Engine, also referred to as Google Earth Engine, provides a cloud-computing platform for Remote Sensings, such as satellite image processing. We can use Javascript or Python to code Earth Engine. There are many kinds of Remote Sensing analyses available to run. In this article, we will discuss specifically Machine Learning for land cover classification based on satellite images. Before we get into the details, I want to describe more on Remote Sensing common knowledge because I assume some readers have Data Science, Machine Learning, or Statistics backgrounds.
Feature Inference Attack on Model Predictions in Vertical Federated Learning
Luo, Xinjian, Wu, Yuncheng, Xiao, Xiaokui, Ooi, Beng Chin
Federated learning (FL) is an emerging paradigm for facilitating multiple organizations' data collaboration without revealing their private data to each other. Recently, vertical FL, where the participating organizations hold the same set of samples but with disjoint features and only one organization owns the labels, has received increased attention. This paper presents several feature inference attack methods to investigate the potential privacy leakages in the model prediction stage of vertical FL. The attack methods consider the most stringent setting that the adversary controls only the trained vertical FL model and the model predictions, relying on no background information. We first propose two specific attacks on the logistic regression (LR) and decision tree (DT) models, according to individual prediction output. We further design a general attack method based on multiple prediction outputs accumulated by the adversary to handle complex models, such as neural networks (NN) and random forest (RF) models. Experimental evaluations demonstrate the effectiveness of the proposed attacks and highlight the need for designing private mechanisms to protect the prediction outputs in vertical FL.
XAI-N: Sensor-based Robot Navigation using Expert Policies and Decision Trees
Roth, Aaron M., Liang, Jing, Manocha, Dinesh
We present a novel sensor-based learning navigation algorithm to compute a collision-free trajectory for a robot in dense and dynamic environments with moving obstacles or targets. Our approach uses deep reinforcement learning-based expert policy that is trained using a sim2real paradigm. In order to increase the reliability and handle the failure cases of the expert policy, we combine with a policy extraction technique to transform the resulting policy into a decision tree format. The resulting decision tree has properties which we use to analyze and modify the policy and improve performance on navigation metrics including smoothness, frequency of oscillation, frequency of immobilization, and obstruction of target. We are able to modify the policy to address these imperfections without retraining, combining the learning power of deep learning with the control of domain-specific algorithms. We highlight the benefits of our algorithm in simulated environments and navigating a Clearpath Jackal robot among moving pedestrians.