Goto

Collaborating Authors

 Decision Tree Learning


Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

arXiv.org Artificial Intelligence

The end of Dennard scaling law -- which stipulated a continuous increase in processor clock frequency by transistor miniaturization -- in conjunction with the continuation of Moore's law -- which expects the number of CMOS transistors within a microchip to double every two years -- shifted the technology trend towards parallel architectures. In the early 2000's parallel computer system architectures focused on multi-core CPU architectures. Later the introduction of the GPGPU paradigms pivoted technology trends to heterogeneous systems composed of both multi-core CPUs and GPUs. This heterogeneity unveiled the challenge of software performance portability. Software performance portability seeks to achieve equivalent performance regardless of the underlying hardware architecture using a single application implementation. Programming models, such as OmpSs [9], OpenMP, Kokkos [10], and RAJA [15], provide abstractions to hide the vendor-specific interfaces required to develop applications on all these heterogeneous parallel architectures and offer unified interfaces to express parallelism. Although these programming models provide a single and convenient layer to implement portable code, the performance of the same application can vary when executed on different architectures and systems. Thus, these programming models efficiently express portable code, but the application performance-portability is unspecified for application executions on different heterogeneous systems. For example, HPC programmers have found that a single version of source code, with an associated static definition of exarXiv:2303.08873v1


Predicting Individualized Effects of Internet-Based Treatment for Genito-Pelvic Pain/Penetration Disorder: Development and Internal Validation of a Multivariable Decision Tree Model

arXiv.org Machine Learning

Genito-Pelvic Pain/Penetration-Disorder (GPPPD) is a common disorder but rarely treated in routine care. Previous research documents that GPPPD symptoms can be treated effectively using internet-based psychological interventions. However, non-response remains common for all state-of-the-art treatments and it is unclear which patient groups are expected to benefit most from an internet-based intervention. Multivariable prediction models are increasingly used to identify predictors of heterogeneous treatment effects, and to allocate treatments with the greatest expected benefits. In this study, we developed and internally validated a multivariable decision tree model that predicts effects of an internet-based treatment on a multidimensional composite score of GPPPD symptoms. Data of a randomized controlled trial comparing the internet-based intervention to a waitlist control group (N =200) was used to develop a decision tree model using model-based recursive partitioning. Model performance was assessed by examining the apparent and bootstrap bias-corrected performance. The final pruned decision tree consisted of one splitting variable, joint dyadic coping, based on which two response clusters emerged. No effect was found for patients with low dyadic coping ($n$=33; $d$=0.12; 95% CI: -0.57-0.80), while large effects ($d$=1.00; 95%CI: 0.68-1.32; $n$=167) are predicted for those with high dyadic coping at baseline. The bootstrap-bias-corrected performance of the model was $R^2$=27.74% (RMSE=13.22).


Are Models Trained on Indian Legal Data Fair?

arXiv.org Artificial Intelligence

Recent advances and applications of language technology and artificial intelligence have enabled much success across multiple domains like law, medical and mental health. AI-based Language Models, like Judgement Prediction, have recently been proposed for the legal sector. However, these models are strife with encoded social biases picked up from the training data. While bias and fairness have been studied across NLP, most studies primarily locate themselves within a Western context. In this work, we present an initial investigation of fairness from the Indian perspective in the legal domain. We highlight the propagation of learnt algorithmic biases in the bail prediction task for models trained on Hindi legal documents. We evaluate the fairness gap using demographic parity and show that a decision tree model trained for the bail prediction task has an overall fairness disparity of 0.237 between input features associated with Hindus and Muslims. Additionally, we highlight the need for further research and studies in the avenues of fairness/bias in applying AI in the legal sector with a specific focus on the Indian context.


Adversarial random forests for density estimation and generative modeling

arXiv.org Artificial Intelligence

We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying $\texttt{R}$ package, $\texttt{arf}$, is available on $\texttt{CRAN}$.


Detection of DDoS Attacks in Software Defined Networking Using Machine Learning Models

arXiv.org Artificial Intelligence

The concept of Software Defined Networking (SDN) represents a modern approach to networking that separates the control plane from the data plane through network abstraction, resulting in a flexible, programmable and dynamic architecture compared to traditional networks. The separation of control and data planes has led to a high degree of network resilience, but has also given rise to new security risks, including the threat of distributed denial-of-service (DDoS) attacks, which pose a new challenge in the SDN environment. In this paper, the effectiveness of using machine learning algorithms to detect distributed denial-of-service (DDoS) attacks in software-defined networking (SDN) environments is investigated. Four algorithms, including Random Forest, Decision Tree, Support Vector Machine, and XGBoost, were tested on the CICDDoS2019 dataset, with the timestamp feature dropped among others. Performance was assessed by measures of accuracy, recall, accuracy, and F1 score, with the Random Forest algorithm having the highest accuracy, at 68.9%. The results indicate that ML-based detection is a more accurate and effective method for identifying DDoS attacks in SDN, despite the computational requirements of non-parametric algorithms.


Credit Card Fraud Detection Using Enhanced Random Forest Classifier for Imbalanced Data

arXiv.org Artificial Intelligence

The credit card has become the most popular payment method for both online and offline transactions. The necessity to create a fraud detection algorithm to precisely identify and stop fraudulent activity arises as a result of both the development of technology and the rise in fraud cases. This paper implements the random forest (RF) algorithm to solve the issue in the hand. A dataset of credit card transactions was used in this study. The main problem when dealing with credit card fraud detection is the imbalanced dataset in which most of the transaction are non-fraud ones. To overcome the problem of the imbalanced dataset, the synthetic minority over-sampling technique (SMOTE) was used. Implementing the hyperparameters technique to enhance the performance of the random forest classifier. The results showed that the RF classifier gained an accuracy of 98% and about 98% of F1-score value, which is promising. We also believe that our model is relatively easy to apply and can overcome the issue of imbalanced data for fraud detection applications.


NFL Career Success as Predicted by NFL Scouting Combine

arXiv.org Artificial Intelligence

The National Football League (NFL) Scouting Combine serves as a tool to evaluate the skills of prospective players and assess their readiness to play in the NFL. The development of machine learning brings new opportunities in assessing the utility of the Scouting Combine. Using machine and statistical learning, it may be possible to predict future success of prospective athletes, as well as predict which Scouting Combine tests are the most important. Results from statistical learning research have been contradicting whether the Scouting combine is a useful metric for player success. In this study, we investigate if machine learning can be used to determine matriculation and future success in the NFL. Using Scouting Combine data, we evaluate six different algorithms' ability to predict whether a potential draft pick will play a single NFL snap (matriculation). If a player is drafted, we predict how many snaps they go on to play (success). We are able to predict matriculation with 83% accuracy; however, we are unable to predict later success. Our best performing algorithm returns large error and low explained variance (RMSE=1,210 snaps; ${R}^2$=0.17). These findings indicate that while the Scouting Combine can predict NFL matriculation, it may not be a reliable predictor of long-term player success.


Lexical Complexity Prediction: An Overview

arXiv.org Artificial Intelligence

Understanding the meaning of words in context is fundamental for reading comprehension. The perceived difficulty, hereafter referred to as complexity, of a target word within a given text varies widely among readers. With an increased demand for distance learning and educational technologies[107], research into automatically predicting which words are likely to cause comprehension problems is becoming a popular area of research [115, 147, 185]. Systems have been created to identify complex words that are difficult to acquire, reproduce, or understand for children [79], second-language learners [89], people suffering from a reading disability, such as dyslexia [131] or aphasia [35, 53], or more generally, individuals with low literacy [59, 175]. In Computational Linguistics and Natural Language Processing (NLP), the task of automatically recognizing complex words is most often achieved by training machine learning (ML) models. These ML models assign a complexity value to each target word within an inputted extract, sentence, or text that allows for the identification of complex words. This information can then be used to improve downstream lexical and text simplification systems that provide simpler alternatives to aid reading comprehension. Take the extract shown in Table 1 for example.


Forecasting the movements of Bitcoin prices: an application of machine learning algorithms

arXiv.org Artificial Intelligence

Cryptocurrencies, such as Bitcoin, are one of the most controversial and complex technological innovations in today's financial system. This study aims to forecast the movements of Bitcoin prices at a high degree of accuracy. To this aim, four different Machine Learning (ML) algorithms are applied, namely, the Support Vector Machines (SVM), the Artificial Neural Network (ANN), the Naive Bayes (NB) and the Random Forest (RF) besides the logistic regression (LR) as a benchmark model. In order to test these algorithms, besides existing continuous dataset, discrete dataset was also created and used. For the evaluations of algorithm performances, the F statistic, accuracy statistic, the Mean Absolute Error (MAE), the Root Mean Square Error (RMSE) and the Root Absolute Error (RAE) metrics were used. The t test was used to compare the performances of the SVM, ANN, NB and RF with the performance of the LR. Empirical findings reveal that, while the RF has the highest forecasting performance in the continuous dataset, the NB has the lowest. On the other hand, while the ANN has the highest and the NB the lowest performance in the discrete dataset. Furthermore, the discrete dataset improves the overall forecasting performance in all algorithms (models) estimated.


Optimal Sparse Recovery with Decision Stumps

arXiv.org Artificial Intelligence

Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these "decision stumps" for the recovery of the $s$ active features from $p$ total features, where $s \ll p$. Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of $O(s \log p)$ as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features $s$ is unknown. We further validate our theoretical results and proof methodology using computational experiments.