Goto

Collaborating Authors

 Accuracy


Data-driven Prediction of Relevant Scenarios for Robust Combinatorial Optimization

arXiv.org Artificial Intelligence

Optimization under uncertainty is an important research field especially due to its relevance in practical applications from operations research. In the real world many parameters of an optimization problem can be uncertain, e.g. the demands, returns or traffic situations or any other parameters which are not precisely known due to measurement or rounding errors. It was shown that hedging against possible perturbations in the problem parameters is essential, since already small perturbations can lead to a large violation of the constraints [BTEGN09]. Driven by the seminal works [Soy73, KY96, BTN98, BTN99, BS04] robust optimization evolved to be one of the most popular approaches to tackle uncertainty in optimization problems by finding solutions which are worst-case optimal and feasible for all parameters of a pre-defined uncertainty set; see [BBC11, BK18, GMT14] for a literature overview. Later the classical robust optimization approach was extended to the two-stage robust optimization approach (also called adaptive robust optimization) in [BTGGN04] which has been extensively studied from then on; see e.g.


Self-Optimizing Feature Transformation

arXiv.org Artificial Intelligence

Feature transformation aims to extract a good representation (feature) space by mathematically transforming existing features. It is crucial to address the curse of dimensionality, enhance model generalization, overcome data sparsity, and expand the availability of classic models. Current research focuses on domain knowledge-based feature engineering or learning latent representations; nevertheless, these methods are not entirely automated and cannot produce a traceable and optimal representation space. When rebuilding a feature space for a machine learning task, can these limitations be addressed concurrently? In this extension study, we present a self-optimizing framework for feature transformation. To achieve a better performance, we improved the preliminary work by (1) obtaining an advanced state representation for enabling reinforced agents to comprehend the current feature set better; and (2) resolving Q-value overestimation in reinforced agents for learning unbiased and effective policies. Finally, to make experiments more convincing than the preliminary work, we conclude by adding the outlier detection task with five datasets, evaluating various state representation approaches, and comparing different training strategies. Extensive experiments and case studies show that our work is more effective and superior.


Calibrated Multiple-Output Quantile Regression with Representation Learning

arXiv.org Artificial Intelligence

We develop a method to generate predictive regions that cover a multivariate response variable with a user-specified probability. Our work is composed of two components. First, we use a deep generative model to learn a representation of the response that has a unimodal distribution. Existing multiple-output quantile regression approaches are effective in such cases, so we apply them on the learned representation, and then transform the solution to the original space of the response. This process results in a flexible and informative region that can have an arbitrary shape, a property that existing methods lack. Second, we propose an extension of conformal prediction to the multivariate response setting that modifies any method to return sets with a pre-specified coverage level. The desired coverage is theoretically guaranteed in the finite-sample case for any distribution. Experiments conducted on both real and synthetic data show that our method constructs regions that are significantly smaller compared to existing techniques.


The choice of scaling technique matters for classification performance

arXiv.org Artificial Intelligence

Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}


Predicting Survival of Tongue Cancer Patients by Machine Learning Models

arXiv.org Artificial Intelligence

Tongue cancer is a common oral cavity malignancy that originates in the mouth and throat. Much effort has been invested in improving its diagnosis, treatment, and management. Surgical removal, chemotherapy, and radiation therapy remain the major treatment for tongue cancer. The survival of patients determines the treatment effect. Previous studies have identified certain survival and risk factors based on descriptive statistics, ignoring the complex, nonlinear relationship among clinical and demographic variables. In this study, we utilize five cutting-edge machine learning models and clinical data to predict the survival of tongue cancer patients after treatment. Five-fold cross-validation, bootstrap analysis, and permutation feature importance are applied to estimate and interpret model performance. The prognostic factors identified by our method are consistent with previous clinical studies. Our method is accurate, interpretable, and thus useable as additional evidence in tongue cancer treatment and management.


Benchmarking Machine Learning Models to Predict Corporate Bankruptcy

arXiv.org Artificial Intelligence

The risk of bankruptcy in a publicly traded firm is of major interest to shareholders, creditors, and employees. Prior literature has investigated the predictive performance of different forecasting models, mainly the discriminant analysis with accounting information (Altman, 1968), the distance to default structural model (Bharath and Shumway, 2008), and the hazard model with accounting and market information (Shumway, 2001; Chava and Jarrow, 2004). In this paper we investigate the benefits of applying high dimensional machine learning (ML) methods to bankruptcy prediction. We use a comprehensive sample of bankruptcies for U.S. publicly traded companies from 1969 to 2019 with financial, market, macro, and text based predictors. We study the performance of eight ML algorithms: the hazard model of Shumway (2001) and Chava and Jarrow (2004) enhanced with a penalty function (LASSO and Ridge), bagged trees (random forest and survival random forest), gradient boosted trees (XG Boost and LightGBM), and two specifications of neural networks (one shallower and one deeper).


Security and Interpretability in Automotive Systems

arXiv.org Artificial Intelligence

The lack of any sender authentication mechanism in place makes CAN (Controller Area Network) vulnerable to security threats. For instance, an attacker can impersonate an ECU (Electronic Control Unit) on the bus and send spoofed messages unobtrusively with the identifier of the impersonated ECU. To address the insecure nature of the system, this thesis demonstrates a sender authentication technique that uses power consumption measurements of the electronic control units (ECUs) and a classification model to determine the transmitting states of the ECUs. The method's evaluation in real-world settings shows that the technique applies in a broad range of operating conditions and achieves good accuracy. A key challenge of machine learning-based security controls is the potential of false positives. A false-positive alert may induce panic in operators, lead to incorrect reactions, and in the long run cause alarm fatigue. For reliable decision-making in such a circumstance, knowing the cause for unusual model behavior is essential. But, the black-box nature of these models makes them uninterpretable. Therefore, another contribution of this thesis explores explanation techniques for inputs of type image and time series that (1) assign weights to individual inputs based on their sensitivity toward the target class, (2) and quantify the variations in the explanation by reconstructing the sensitive regions of the inputs using a generative model. In summary, this thesis (https://uwspace.uwaterloo.ca/handle/10012/18134) presents methods for addressing the security and interpretability in automotive systems, which can also be applied in other settings where safe, transparent, and reliable decision-making is crucial.


A Study of Left Before Treatment Complete Emergency Department Patients: An Optimized Explanatory Machine Learning Framework

arXiv.org Artificial Intelligence

The issue of left before treatment complete (LBTC) patients is common in emergency departments (EDs). This issue represents a medico-legal risk and may cause a revenue loss. Thus, understanding the factors that cause patients to leave before treatment is complete is vital to mitigate and potentially eliminate these adverse effects. This paper proposes a framework for studying the factors that affect LBTC outcomes in EDs. The framework integrates machine learning, metaheuristic optimization, and model interpretation techniques. Metaheuristic optimization is used for hyperparameter optimization--one of the main challenges of machine learning model development. Three metaheuristic optimization algorithms are employed for optimizing the parameters of extreme gradient boosting (XGB), which are simulated annealing (SA), adaptive simulated annealing (ASA), and adaptive tabu simulated annealing (ATSA). The optimized XGB models are used to predict the LBTC outcomes for the patients under treatment in ED. The designed algorithms are trained and tested using four data groups resulting from the feature selection phase. The model with the best predictive performance is interpreted using SHaply Additive exPlanations (SHAP) method. The findings show that ATSA-XGB outperformed other mode configurations with an accuracy, area under the curve (AUC), sensitivity, specificity, and F1-score of 86.61%, 87.50%, 85.71%, 87.51%, and 86.60%, respectively. The degree and the direction of effects of each feature were determined and explained using the SHAP method.


Actionable Auditing Revisited

Communications of the ACM

Non-target corporations Kairos and Amazon have overall error rates of 6.60% and 8.66%, respectively. These are the worst current performances of the companies analyzed in the follow-up audit. Nonetheless, when comparing to the previous May 2017 performance of target corporations, the Kairos and Amazon error rates are lower than the former error rates of IBM (12.1%) and Face (9.9%) and only slightly higher than Microsoft's performance (6.2%) from the initial study.


PABAU: Privacy Analysis of Biometric API Usage

arXiv.org Artificial Intelligence

Biometric data privacy is becoming a major concern for many organizations in the age of big data, particularly in the ICT sector, because it may be easily exploited in apps. Most apps utilize biometrics by accessing common application programming interfaces (APIs); hence, we aim to categorize their usage. The categorization based on behavior may be closely correlated with the sensitive processing of a user's biometric data, hence highlighting crucial biometric data privacy assessment concerns. We propose PABAU, Privacy Analysis of Biometric API Usage. PABAU learns semantic features of methods in biometric APIs and uses them to detect and categorize the usage of biometric API implementation in the software according to their privacy-related behaviors. This technique bridges the communication and background knowledge gap between technical and non-technical individuals in organizations by providing an automated method for both parties to acquire a rapid understanding of the essential behaviors of biometric API in apps, as well as future support to data protection officers (DPO) with legal documentation, such as conducting a Data Protection Impact Assessment (DPIA).