Ensemble Learning
C-XGBoost: A tree boosting model for causal effect estimation
Kiriakidou, Niki, Livieris, Ioannis E., Diou, Christos
Causal effect estimation aims at estimating the Average Treatment Effect as well as the Conditional Average Treatment Effect of a treatment to an outcome from the available data. This knowledge is important in many safety-critical domains, where it often needs to be extracted from observational data. In this work, we propose a new causal inference model, named C-XGBoost, for the prediction of potential outcomes. The motivation of our approach is to exploit the superiority of tree-based models for handling tabular data together with the notable property of causal inference neural network-based models to learn representations that are useful for estimating the outcome for both the treatment and non-treatment cases. The proposed model also inherits the considerable advantages of XGBoost model such as efficiently handling features with missing values requiring minimum preprocessing effort, as well as it is equipped with regularization techniques to avoid overfitting/bias. Furthermore, we propose a new loss function for efficiently training the proposed causal inference model. The experimental analysis, which is based on the performance profiles of Dolan and Mor{\'e} as well as on post-hoc and non-parametric statistical tests, provide strong evidence about the effectiveness of the proposed approach.
Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach
Bobes-Bascarán, José, Mosqueira-Rey, Eduardo, Fernández-Leal, Ángel, Hernández-Pereira, Elena, Alonso-Ríos, David, Moret-Bonillo, Vicente, Figueirido-Arnoso, Israel, Vidal-Ínsua, Yolanda
Explainable AI (XAI) [1] is a research field focused on making Artificial Intelligence (AI) systems in general, and Machine Learning (ML) systems in particular, more understandable to humans. Explainable AI offers several advantages, to name a few: it fosters confidence in the prediction of the model by making the decision-making process more transparent, promotes responsible AI development, aids in debugging and identifying issues, and allows auditing of AI models and checking if they adhere to regulatory standards. The inherent explainability of AI systems has not remained static but has changed considerably as a result of technological progress. In fact, explainability has become an increasingly difficult issue to tackle, as the internal functioning of AI systems has become less intelligible as they have become more complex [2]. Initially, symbolic AI models were explainable per se, e.g., rule-based expert systems could easily show to their users which rules they had followed to make a given decision, even though the rules can incorporate measures of uncertainty and imprecision as, for example, in fuzzy systems. These type of AI models are considered transparent, which means that the model itself is understandable [3], being understandability the characteristic of a model to make a human understand its function without any need for explaining its internal structure or the algorithmic means by which the model processes data internally [4].
Dealing with Imbalanced Classes in Bot-IoT Dataset
Atuhurra, Jesse, Hara, Takanori, Zhang, Yuanyu, Sasabe, Masahiro, Kasahara, Shoji
With the rapidly spreading usage of Internet of Things (IoT) devices, a network intrusion detection system (NIDS) plays an important role in detecting and protecting various types of attacks in the IoT network. To evaluate the robustness of the NIDS in the IoT network, the existing work proposed a realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to machine learning-based anomaly detection. This dataset contains imbalanced normal and attack packets because the number of normal packets is much smaller than that of attack ones. The nature of imbalanced data may make it difficult to identify the minority class correctly. In this thesis, to address the class imbalance problem in the Bot-IoT dataset, we propose a binary classification method with synthetic minority over-sampling techniques (SMOTE). The proposed classifier aims to detect attack packets and overcome the class imbalance problem using the SMOTE algorithm. Through numerical results, we demonstrate the proposed classifier's fundamental characteristics and the impact of imbalanced data on its performance.
Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection
Li, Zhenglin, Zhu, Haibei, Liu, Houze, Song, Jintong, Cheng, Qishuo
This study conducts a thorough examination of malware detection using machine learning techniques, focusing on the evaluation of various classification models using the Mal-API-2019 dataset. The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively. Both ensemble and non-ensemble machine learning methods, such as Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks, are explored. Special emphasis is placed on the importance of data pre-processing techniques, particularly TF-IDF representation and Principal Component Analysis, in improving model performance. Results indicate that ensemble methods, particularly Random Forest and XGBoost, exhibit superior accuracy, precision, and recall compared to others, highlighting their effectiveness in malware detection. The paper also discusses limitations and potential future directions, emphasizing the need for continuous adaptation to address the evolving nature of malware. This research contributes to ongoing discussions in cybersecurity and provides practical insights for developing more robust malware detection systems in the digital era.
Beyond Quantities: Machine Learning-based Characterization of Inequality in Infrastructure Quality Provision in Cities
The objective of this study is to characterize inequality in infrastructure quality across urban areas. While a growing of body of literature has recognized the importance of characterizing infrastructure inequality in cities and provided quantified metrics to inform urban development plans, the majority of the existing approaches focus primarily on measuring the quantity of infrastructure, assuming that more infrastructure is better. Also, the existing research focuses primarily on index-based approaches in which the status of infrastructure provision in urban areas is determined based on assumed subjective weights. The focus on infrastructure quantity and use of indices obtained from subjective weights has hindered the ability to properly examine infrastructure inequality as it pertains to urban inequality and environmental justice considerations. Recognizing this gap, we propose a machine learning-based approach in which infrastructure features that shape environmental hazard exposure are identified and we use the weights obtained by the model to calculate an infrastructure quality provision for spatial areas of cities and accordingly, quantify the extent of inequality in infrastructure quality. The implementation of the model in five metropolitan areas in the U.S. demonstrates the capability of the proposed approach in characterizing inequality in infrastructure quality and capturing city-specific differences in the weights of infrastructure features. The results also show that areas in which low-income populations reside have lower infrastructure quality provision, suggesting the lower infrastructure quality provision as a determinant of urban disparities. Accordingly, the proposed approach can be effectively used to inform integrated urban design strategies to promote infrastructure equity and environmental justice based on data-driven and machine intelligence-based insights.
Utilizing the LightGBM Algorithm for Operator User Credit Assessment Research
Li, Shaojie, Dong, Xinqi, Ma, Danqing, Dang, Bo, Zang, Hengyi, Gong, Yulu
Mobile Internet user credit assessment is an important way for communication operators to establish decisions and formulate measures, and it is also a guarantee for operators to obtain expected benefits. However, credit evaluation methods have long been monopolized by financial industries such as banks and credit. As supporters and providers of platform network technology and network resources, communication operators are also builders and maintainers of communication networks. Internet data improves the user's credit evaluation strategy. This paper uses the massive data provided by communication operators to carry out research on the operator's user credit evaluation model based on the fusion LightGBM algorithm. First, for the massive data related to user evaluation provided by operators, key features are extracted by data preprocessing and feature engineering methods, and a multi-dimensional feature set with statistical significance is constructed; then, linear regression, decision tree, LightGBM, and other machine learning algorithms build multiple basic models to find the best basic model; finally, integrates Averaging, Voting, Blending, Stacking and other integrated algorithms to refine multiple fusion models, and finally establish the most suitable fusion model for operator user evaluation.
Function Trees: Transparent Machine Learning
A fundamental exercise in machine learning is the approximation of a function of several to many variables given values of the function, often contaminated with noise, at observed joint values of the input variables. The result can then be used to estimate unknown function values given corresponding inputs. The goal is to accurately estimate the underlying (non noisy) outcome values since the noise is by definition unpredictable. To the extent that this is successful the estimated function may, in addition, be used to try to understand underlying phenomena giving rise to the data. Even when prediction accuracy is the dominate concern, being able to comprehend the way in which the input variables are jointly combining to produce predictions may lead to important sanity checks on the validity of the function estimate. Besides accuracy, the success of this latter exercise requires that the structure of the function estimate be represented in a comprehensible form.
Machine Learning and Vision Transformers for Thyroid Carcinoma Diagnosis: A review
Habchi, Yassine, Kheddar, Hamza, Himeur, Yassine, Boukabou, Abdelkrim, Chouchane, Ammar, Ouamane, Abdelmalik, Atalla, Shadi, Mansoor, Wathiq
The growing interest in developing smart diagnostic systems to help medical experts process extensive data for treating incurable diseases has been notable. In particular, the challenge of identifying thyroid cancer (TC) has seen progress with the use of machine learning (ML) and big data analysis, incorporating transformers to evaluate TC prognosis and determine the risk of malignancy in individuals. This review article presents a summary of various studies on AIbased approaches, especially those employing transformers, for diagnosing TC. It introduces a new categorization system for these methods based on artifcial intelligence (AI) algorithms, the goals of the framework, and the computing environments used. Additionally, it scrutinizes and contrasts the available TC datasets by their features. The paper highlights the importance of AI instruments in aiding the diagnosis and treatment of TC through supervised, unsupervised, or mixed approaches, with a special focus on the ongoing importance of transformers in medical diagnostics and disease management. It further discusses the progress made and the continuing obstacles in this area. Lastly, it explores future directions and focuses within this research feld.
A New Random Forest Ensemble of Intuitionistic Fuzzy Decision Trees
Ren, Yingtao, Zhu, Xiaomin, Bai, Kaiyuan, Zhang, Runtong
Classification is essential to the applications in the field of data mining, artificial intelligence, and fault detection. There exists a strong need in developing accurate, suitable, and efficient classification methods and algorithms with broad applicability. Random forest is a general algorithm that is often used for classification under complex conditions. Although it has been widely adopted, its combination with diverse fuzzy theory is still worth exploring. In this paper, we propose the intuitionistic fuzzy random forest (IFRF), a new random forest ensemble of intuitionistic fuzzy decision trees (IFDT). Such trees in forest use intuitionistic fuzzy information gain to select features and consider hesitation in information transmission. The proposed method enjoys the power of the randomness from bootstrapped sampling and feature selection, the flexibility of fuzzy logic and fuzzy sets, and the robustness of multiple classifier systems. Extensive experiments demonstrate that the IFRF has competitative and superior performance compared to other state-of-the-art fuzzy and ensemble algorithms. IFDT is more suitable for ensemble learning with outstanding classification accuracy. This study is the first to propose a random forest ensemble based on the intuitionistic fuzzy theory.
Uncertainty estimation in spatial interpolation of satellite precipitation with ensemble learning
Papacharalampous, Georgia, Tyralis, Hristos, Doulamis, Nikolaos, Doulamis, Anastasios
Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond.