Goto

Collaborating Authors

 Regression


Automated Assessment of Residual Plots with Computer Vision Models

arXiv.org Machine Learning

Plotting the residuals is a recommended procedure to diagnose deviations from linear model assumptions, such as non-linearity, heteroscedasticity, and non-normality. The presence of structure in residual plots can be tested using the lineup protocol to do visual inference. There are a variety of conventional residual tests, but the lineup protocol, used as a statistical test, performs better for diagnostic purposes because it is less sensitive and applies more broadly to different types of departures. However, the lineup protocol relies on human judgment which limits its scalability. This work presents a solution by providing a computer vision model to automate the assessment of residual plots. It is trained to predict a distance measure that quantifies the disparity between the residual distribution of a fitted classical normal linear regression model and the reference distribution, based on Kullback-Leibler divergence. From extensive simulation studies, the computer vision model exhibits lower sensitivity than conventional tests but higher sensitivity than human visual tests. It is slightly less effective on non-linearity patterns. Several examples from classical papers and contemporary data illustrate the new procedures, highlighting its usefulness in automating the diagnostic process and supplementing existing methods.


Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models

arXiv.org Artificial Intelligence

This paper introduces Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models. These priors are extensions of traditional mixtures of $g$ priors that allow for differential shrinkage for various (data-selected) blocks of parameters while fully accounting for the predictors' correlation structure, providing a bridge between the literatures on model selection and continuous shrinkage priors. We show that Dirichlet process mixtures of block $g$ priors are consistent in various senses and, in particular, that they avoid the conditional Lindley ``paradox'' highlighted by Som et al.(2016). Further, we develop a Markov chain Monte Carlo algorithm for posterior inference that requires only minimal ad-hoc tuning. Finally, we investigate the empirical performance of the prior in various real and simulated datasets. In the presence of a small number of very large effects, Dirichlet process mixtures of block $g$ priors lead to higher power for detecting smaller but significant effects without only a minimal increase in the number of false discoveries.


Classification problem in liability insurance using machine learning models: a comparative study

arXiv.org Machine Learning

The insurance company uses different factors to classify the policyholders. In this study, we apply several machine learning models such as nearest neighbour and logistic regression to the Actuarial Challenge dataset used by Qazvini (2019) to classify liability insurance policies into two groups: 1 - policies with claims and 2 - policies without claims. The applications of Machine Learning (ML) models and Artificial Intelligence (AI) in areas such as medical diagnosis, economics, banking, fraud detection, agriculture, etc, have been known for quite a number of years. ML models have changed these industries remarkably. However, despite their high predictive power and their capability to identify nonlinear transformations and interactions between variables, they are slowly being introduced into the insurance industry and actuarial fields.


Building trust in AI: Transparent models for better decisions

AIHub

AI is becoming a part of our daily lives, from approving loans to diagnosing diseases. AI model outputs are used to make increasingly important decisions, based on smart algorithms and data. But if we can't understand these decisions, how can we trust them? One approach to making AI decisions more understandable is to use models that are inherently interpretable. These are models that are designed in such a way that consumers of the model outputs can infer the model's behaviour by reading the parameters of the model. Popular inherently interpretable models include Decision Trees and Linear Regression.


Explainable Artificial Intelligence for Dependent Features: Additive Effects of Collinearity

arXiv.org Machine Learning

Explainable Artificial Intelligence (XAI) emerged to reveal the internal mechanism of machine learning models and how the features affect the prediction outcome. Collinearity is one of the big issues that XAI methods face when identifying the most informative features in the model. Current XAI approaches assume the features in the models are independent and calculate the effect of each feature toward model prediction independently from the rest of the features. However, such assumption is not realistic in real life applications. We propose an Additive Effects of Collinearity (AEC) as a novel XAI method that aim to considers the collinearity issue when it models the effect of each feature in the model on the outcome. AEC is based on the idea of dividing multivariate models into several univariate models in order to examine their impact on each other and consequently on the outcome. The proposed method is implemented using simulated and real data to validate its efficiency comparing with the a state of arts XAI method. The results indicate that AEC is more robust and stable against the impact of collinearity when it explains AI models compared with the state of arts XAI method.


Assessing Concordance between RNA-Seq and NanoString Technologies in Ebola-Infected Nonhuman Primates Using Machine Learning

arXiv.org Artificial Intelligence

This study evaluates the concordance between RNA sequencing (RNA-Seq) and NanoString technologies for gene expression analysis in non-human primates (NHPs) infected with Ebola virus (EBOV). We performed a detailed comparison of both platforms, demonstrating a strong correlation between them, with Spearman coefficients for 56 out of 62 samples ranging from 0.78 to 0.88, with a mean of 0.83 and a median of 0.85. Bland-Altman analysis further confirmed high consistency, with most measurements falling within 95% confidence limits. A machine learning approach, using the Supervised Magnitude-Altitude Scoring (SMAS) method trained on NanoString data, identified OAS1 as a key marker for distinguishing RT-qPCR positive from negative samples. Remarkably, when applied to RNA-Seq data, OAS1 also achieved 100% accuracy in differentiating infected from uninfected samples using logistic regression, demonstrating its robustness across platforms. Further differential expression analysis identified 12 common genes including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL which demonstrated the highest levels of statistical significance and biological relevance across both platforms. Gene Ontology (GO) analysis confirmed that these genes are directly involved in key immune and viral infection pathways, reinforcing their importance in EBOV infection. In addition, RNA-Seq uniquely identified genes such as CASP5, USP18, and DDX60, which play key roles in immune regulation and antiviral defense. This finding highlights the broader detection capabilities of RNA-Seq and underscores the complementary strengths of both platforms in providing a comprehensive and accurate assessment of gene expression changes during Ebola virus infection.


A Study of Secure Algorithms for Vertical Federated Learning: Take Secure Logistic Regression as an Example

arXiv.org Artificial Intelligence

After entering the era of big data, more and more companies build services with machine learning techniques. However, it is costly for companies to collect data and extract helpful handcraft features on their own. Although it is a way to combine with other companies' data for boosting the model's performance, this approach may be prohibited by laws. In other words, finding the balance between sharing data with others and keeping data from privacy leakage is a crucial topic worthy of close attention. This paper focuses on distributed data and conducts secure model-training tasks on a vertical federated learning scheme. Here, secure implies that the whole process is executed in the encrypted domain; therefore, the privacy concern is released.


Progression: an extrapolation principle for regression

arXiv.org Machine Learning

The problem of regression extrapolation, or out-of-distribution generalization, arises when predictions are required at test points outside the range of the training data. In such cases, the non-parametric guarantees for regression methods from both statistics and machine learning typically fail. Based on the theory of tail dependence, we propose a novel statistical extrapolation principle. After a suitable, data-adaptive marginal transformation, it assumes a simple relationship between predictors and the response at the boundary of the training predictor samples. This assumption holds for a wide range of models, including non-parametric regression functions with additive noise. Our semi-parametric method, progression, leverages this extrapolation principle and offers guarantees on the approximation error beyond the training data range. We demonstrate how this principle can be effectively integrated with existing approaches, such as random forests and additive models, to improve extrapolation performance on out-of-distribution samples.


Very fast Bayesian Additive Regression Trees on GPU

arXiv.org Machine Learning

BART Bayesian Additive Regression Trees (BART) is a nonparametric Bayesian regression method, introduced by Chipman, George, and McCulloch (2006, 2010). It defines a prior distribution over the space of functions by representing them as a sum of binary decision trees, and then specifying a stochastic tree generation process. The posterior is then obtained with Metropolis-Gibbs sampling over the trees. See Hill, Linero, and Murray (2020) for a review, and Daniels, Linero, and Roy (2023, ch. 5) for a textbook treatment. BART's success BART has proven empirically effective, and is gaining popularity (consider, e.g., Tan and Roy 2019). The Atlantic Causal Inference Conference (ACIC) Data Challenge has confirmed BART as one of the best regression methods for causal inference (Dorie et al. 2019; Gruber et al. 2019; Hahn, Dorie, and Murray 2019; Thal and Finucane 2023). Many BART variants have been developed throughout the years, adding features such as variable selection (Linero 2018).


Assessing the Auditability of AI-integrating Systems: A Framework and Learning Analytics Case Study

arXiv.org Artificial Intelligence

Audits contribute to the trustworthiness of Learning Analytics (LA) systems that integrate Artificial Intelligence (AI) and may be legally required in the future. We argue that the efficacy of an audit depends on the auditability of the audited system. Therefore, systems need to be designed with auditability in mind. We present a framework for assessing the auditability of AI-integrating systems that consists of three parts: (1) Verifiable claims about the validity, utility and ethics of the system, (2) Evidence on subjects (data, models or the system) in different types (documentation, raw sources and logs) to back or refute claims, (3) Evidence must be accessible to auditors via technical means (APIs, monitoring tools, explainable AI, etc.). We apply the framework to assess the auditability of Moodle's dropout prediction system and a prototype AI-based LA. We find that Moodle's auditability is limited by incomplete documentation, insufficient monitoring capabilities and a lack of available test data. The framework supports assessing the auditability of AI-based LA systems in use and improves the design of auditable systems and thus of audits.