Goto

Collaborating Authors

 Regression


Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

arXiv.org Artificial Intelligence

As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \in \{0, \ldots, H-1\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call \textbf{P}robability-space \textbf{C}onformalized \textbf{Q}uantile \textbf{R}egression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain marginal guarantees for the probability that the cumulative reward of an autonomous system will fall within an arbitrary user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.


Making Linear Predictions in PyTorch - MachineLearningMastery.com Making Linear Predictions in PyTorch - MachineLearningMastery.com

#artificialintelligence

Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the personโ€™s weight (thatโ€™s what BMI is based on). To do this, we need to find the slope and intercept of the line. [โ€ฆ]


October 2022: "Top 40" New CRAN Packages

#artificialintelligence

One hundred seventy-four new packages made it to CRAN in October. Here are my โ€œTop 40โ€ selections in sixteen categories: Astronomy, Biology, Business, Computational Methods, Data, Ecology, Finance, Genomics, Mathematics, Machine Learning, Medicine, Pharma, Statistics, Time Series, Utilities, Visualization. Astronomy skylight v1.1: Provides a function to calculate sky illuminance values (in lux) for both the sun and moon. The model is a verbatim translation of the code by Janiczek and DeYoung (1987). There are vignettes for Use and Advanced Use. Biology palaeoverse v1.0.0: Provides tools to support data preparation and exploration for palaeobiological analyses including functions for data cleaning, binning (time and space), summarisation and visualisation with the goals of improving code reproducibility and accessibility and establishing standards for the palaeobiological community. See Jones et al. for details, and the contribution guide to get involved. pirouette v1.6.5: Implements a method to create a Bayesian posterior from a phylogeny that depicts the true evolutionary relationships. See Richรจl et al. (2020) for background. There are several vignettes including a Tutorial, a demo, and a guide showing how to use the package in a scientific experiment. Business bupaverse v0.1.0: Facilitates loading the packages comprising the bupaverse, an integrated suite of R packages for handling and analysing business process data, developed by the Business Informatics research group at Hasselt University, Belgium. See the Getting Started Guide. Computational Methods fastWavelets v1.0.1: Provides an Rcpp implementation of the Maximal Overlap Discrete Wavelet Transform (MODWT) and the ร€ Trous Discrete Wavelet Transform. See Quilty & Adamowski (2018) for background and README for examples. gips v1.0.0: Employs the methods described in Graczyk et al. (2022) to find the permutation symmetry group under which the covariance matrix of the data is invariant. See the vignettes Optimizers, Theory, and gips. HomomorphicEncryption v0.1.0: Implements the Brakerski-Fan-Vercauteren (2012), Brakerski-Gentry-Vaikuntanathan (2014), and Cheon-Kim-Kim-Song (2016) schema for fully homomorphic encryption. There are seven short vignettes including BFV, BGV, and CKKS. rxode2random v2.0.9: Implements parallel random number generation. See Wang et al. (2016) and Fidler et al (2019) for background and README for an example.. Data airnow v0.1.0: Provides functions to retrieve U.S. Government AirNow air quality data. See README to get started. amazonadsR v0.1.0: Provides functions to collect data on digital marketing campaigns using the Windsor.ai API. See the tutorial for an example and also look at the related new packages: bingadsR, facebookadsR, googleadsR, instagramadsR, linkedinadsR, pinterestadsR, redditadsR, snapchatadsR, ticktokadsR, twitteradsR. Pablo Sanchez was on a roll in October. congress v0.0.1: Provides functions to download and read data on United States congressional proceedings through the Congress.gov API of the Library of Congress. See README for an example. Ecology canaper v1.0.0: Provides functions to analyze the spatial distribution of biodiversity especially useful in the categorical analysis of neo- and paleo-endemism (CANAPE) as described in Mishler et al. (2014) and for statistical tests to determine the types of endemism that occur in a study area while accounting for the evolutionary relationships of species. There are vignettes on CANAPE, randomization, and parallel computing. EcoEnsemble v1.0.1: Provides functions to fit and sample from the ensemble model described in Spence et al (2018). There is an Introduction and there are two additional vignettes: ExploringPriors and SyntheticData. rTRIPLEXCWFlux v0.2.0: Encodes the carbon uptake submodule and evapotranspiration submodule of the TRIPLEX-CW-Flux model to run the simulation of carbon-water coupling. See Zhou et al. (2008) Monteith (1965) for background and the vignette for examples. stopdetection v0.1.1: Enables stop detection in time stamped trajectory by implementing the Stay Point detection algorithm originally described in Ye (2009) that uses time and distance thresholds to characterize spatial regions as stops. See the vignette for examples. Finance highOrderPortfolios v0.1.0: Implements methods to select portfolios using high order moments to characterize return distributions. See Zhou & Palomar (2021) and Wang et al. (2022) for the theory and the vignette to get started. MSTest v0.1.0: Implements hypothesis testing procedures described in Hansen (1992), Carrasco, Hu, & Ploberger (2014) and Dufour & Luger (2017) that can be used to identify the number of regimes in Markov switching models. See README for an example. Genomics metevalue v0.1.13: Implements the e-value method to correct p-values in omics data association studies. See Hebestreit & Klein (2022) and Akalin et.al (2012) for background and the vignette for an example. SCpubr v1.0.4: Implements a system that provides a streamlined way of generating publication ready plots for known Single-Cell transcriptomics data. Look here for an online reference manual. Mathematics Boov v1.0.0: Provides functions to perform the Boolean operations union, difference and intersection on volumes. Computations are done by the C++ library CGAL. See README for some examples. Also, have a look at the package MinkowskiSum. fitode v0.1.1: Provides methods and functions for fitting ordinary differential equations that use sensitivity equations to compute gradients of ODE trajectories with respect to underlying parameters. See the vignette for details. manifold v0.1.1: Implements operations for Riemannian manifolds including geodesic distance, Riemannian metric, and exponential and logarithm maps, and also incorporates a random object generator on the manifolds. See Dai, Lin, and Mรผller (2021) for details. Machine Learning SoftBart v1.0.1: Implements the SoftBart model of described by Linero and Yang (2018) with the optional use of a sparsity-inducing prior to allow for variable selection. The vignette contains theory and examples. tidyfit v0.5.1: Extends the tidy data environment with functions to fit and cross validate linear regression and classification algorithms on grouped data. There are several vignettes including Predicting Boston House Prices, Multinomial Classification, and Rolling Window Time Series Regression. Medicine cities v0.1.0: Provides functions to simulate clinical trials and summarize causal effects and treatment policy estimands in the presence of intercurrent events. Have a look at the demo. RCT2 v0.0.1: Implements various statistical methods for designing and analyzing two-stage randomized controlled trials using the methods developed by Imai, Jiang, and Malani (2021) and Imai, Jiang, and Malani (2022). There are vignettes on Interference and Causal Inference. Pharma DTSEA v0.0.3: Implements a novel tool to identify candidate drugs against a particular disease based on the drug target set enrichment analysis. It assumes the most effective drugs are those with a closer affinity in the protein-protein interaction network to the specified disease. See Gรณmez-Carballa et al. (2022) and Feng et al. (2022) for disease expression profiles, Wishart et al. (2018) and Gaulton et al. (2017) for drug target information, and Kanehisa et al. (2021) for the details of KEGG database. There is a vignette. nlmixr2lib v0.1.0: Provides tools to create model libraries for nlmixr2. Models include pharmacokinetic, pharmacodynamic, and disease models used in pharmacometrics. See the vignette Creating a model library. Statistics aIc v1.0: Implements set of tests for compositional pathologies including for coherence of correlations as suggested by Erb et al. (2020), compositional dominance of distance, compositional perturbation invariance as suggested by (Aitchison (1992) and singularity of the covariation matrix. See the vignette for details and examples. ktweedie v1.0.1: Uses Reproducing Kernel Hilbert Space methods to implement Tweedie compound Poisson gamma models with high-dimensional predictors for the analyses of zero-inflated response variables. See the vignette for examples. missoNet v1.0.0: Implements efficient procedures for fitting conditional graphical lasso models linking predictor variables to response variables or tasks, when the response data may contain missing values. See the vignette for examples. ShalpeyOutlier v0.1.0: Provides methods to use Shapley values to detect, explain, and cell wise impute multivariate outliers. See Mayrhofer and Filzmoser (2022) for details and the vignette for examples. SpatialfdaR v1.0.0: Provides functions to that implement finite element analysis methods to spatial functional data analysis. See Sangalli et al. (2013) and Bernardi et al. (2018) for background and the vignette for an example. Time Series dfms v0..1.3: Provides a user friendly and computationally efficient approach to estimate linear Gaussian dynamic factor models using Kalman filter and EM algorithm methods. See Doz et al. (2011) and Banbura & Modugno (2014) for background and the vignette for examples. Utilities ExclusionTable v1.0.0: Provides functions for creating tables of excluded observations by reporting the number before and after each subset() call together with the number of observations that have been excluded. See the vignette. shiny.tailwind v0.2.2: Allows TailwindCSS to be used in Shiny apps with just-in-time compiling including custom CSS with @apply directive, and custom tailwind configurations. See README for examples. Visualization AlphaHull3D v1.1.0: Provides functions to compute the alpha hull of a set of points (informallly: the shape formed by these points) in 3D space. See README for some visualizations, and also have a look at the related packages MeshesTools, and PolygonSoup. bangladesh v1.0.0: Provides sf objects, shape files, and functions to draw regional chorpleth maps for Bangladesh. See the vignette. ggstats v0.1.0: Provides functions to create forest plots of regression model coefficients along with new statistics to compute proportions, weighted mean and cross-tabulation statistics, as well as new geometries to add alternative background color to a plot. There are vignettes on plotting coefficients and on computing cross-tabulation, custom proportions, and weighted means. jagshelper v0.1.11: Provides tools to streamline Bayesian analyses in JAGSincluding functions for extracting output, streamlining assessment of convergence, and producing summary plots. See the vignette for examples. roughsf v1.0.0: Provides functions to draw maps, including โ€œsketchyโ€, hand-drawn-like maps using the Javascript library Roughjs. See README for examples.


Malign Overfitting: Interpolation Can Provably Preclude Invariance

arXiv.org Artificial Intelligence

Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.


Discretized Linear Regression and Multiclass Support Vector Based Air Pollution Forecasting Technique

arXiv.org Artificial Intelligence

Air pollution is a vital issue emerging from the uncontrolled utilization of traditional energy sources as far as developing countries are concerned. Hence, ingenious air pollution forecasting methods are indispensable to minimize the risk. To that end, this paper proposes an Internet of Things (IoT) enabled system for monitoring and controlling air pollution in the cloud computing environment. A method called Linear Regression and Multiclass Support Vector (LR-MSV) IoT-based Air Pollution Forecast is proposed to monitor the air quality data and the air quality index measurement to pave the way for controlling effectively. Extensive experiments carried out on the air quality data in the India dataset have revealed the outstanding performance of the proposed LR-MSV method when benchmarked with well-established state-of-the-art methods. The results obtained by the LR-MSV method witness a significant increase in air pollution forecasting accuracy by reducing the air pollution forecasting time and error rate compared with the results produced by the other state-of-the-art methods


Surgical Scheduling via Optimization and Machine Learning with Long-Tailed Data

arXiv.org Artificial Intelligence

Using data from cardiovascular surgery patients with long and highly variable post-surgical lengths of stay (LOS), we develop a modeling framework to reduce recovery unit congestion. We estimate the LOS and its probability distribution using machine learning models, schedule procedures on a rolling basis using a variety of optimization models, and estimate performance with simulation. The machine learning models achieved only modest LOS prediction accuracy, despite access to a very rich set of patient characteristics. Compared to the current paper-based system used in the hospital, most optimization models failed to reduce congestion without increasing wait times for surgery. A conservative stochastic optimization with sufficient sampling to capture the long tail of the LOS distribution outperformed the current manual process and other stochastic and robust optimization approaches. These results highlight the perils of using oversimplified distributional models of LOS for scheduling procedures and the importance of using optimization methods well-suited to dealing with long-tailed behavior.


Latent SHAP: Toward Practical Human-Interpretable Explanations

arXiv.org Artificial Intelligence

Model agnostic feature attribution algorithms (such as SHAP and LIME) are ubiquitous techniques for explaining the decisions of complex classification models, such as deep neural networks. However, since complex classification models produce superior performance when trained on low-level (or encoded) features, in many cases, the explanations generated by these algorithms are neither interpretable nor usable by humans. Methods proposed in recent studies that support the generation of human-interpretable explanations are impractical, because they require a fully invertible transformation function that maps the model's input features to the human-interpretable features. In this work, we introduce Latent SHAP, a black-box feature attribution framework that provides human-interpretable explanations, without the requirement for a fully invertible transformation function. We demonstrate Latent SHAP's effectiveness using (1) a controlled experiment where invertible transformation functions are available, which enables robust quantitative evaluation of our method, and (2) celebrity attractiveness classification (using the CelebA dataset) where invertible transformation functions are not available, which enables thorough qualitative evaluation of our method.


An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression

arXiv.org Artificial Intelligence

High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest-solution guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the shortest least-squares solution of the recursively decimated linear models, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, adaptive LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications. Detecting the relationship between a response and a set of predictors is a common problem encountered in different branches of scientific research. This problem is referred to as regression analysis in statistics.


Statistical Learning and Inverse Problems: A Stochastic Gradient Approach

arXiv.org Artificial Intelligence

Inverse Problems (IP) might be described as the search of an unknown parameter (that could be a function) that satisfies a given, known equation. Considering the notation: y = A[f] + noise, where f and y are elements of given Hilbert spaces, we would like to compute (or estimate) f given the data y for some level of noise. Typically, IPs are ill-posed in the sense that the solution does not depend continuously on the data. There are several very important and impressive examples of IPs in our daily lives. Medical imaging has been using IPs for decades and it has shaped the area, as for instance, Computerized Tomography (CT) and Magnetic Resonance Imaging (MRI). For an introductory text, see Vogel (2002). A vast literature of IPs is devoted to deterministic problems where the noise term is also a element of a Hilbert space and commonly assumed small in norm, which is not usually verified in practice.


Logistic Regression in Python

#artificialintelligence

The logistic regression algorithm is a probabilistic machine learning algorithm used for classification tasks. This is usually the first classification algorithm you'll try a classification task on. Unlike many machine learning algorithms that seem to be a black box, the logisitc regression algorithm is easily understood. In this tutorial, you'll learn everything you need to know about the logistic regression algorithm. You'll start by creating a custom logistic regresssion algorithm. This will help you understand everything happening under the hood and how to debug problems with your logisitic regression models. Next, you'll learn how to train and optimize Scikit-Learn implementation of the logistic regression algorithm. Finally, you'll learn how to handle multiclass classification tasks with this algorithm. This tutorial covers L1 and L2 regularization, hyperparameter tuning using grid search, automating machine learning workflow with pipeline, one vs rest classifier, object-oriented programming, modular programming, and documenting Python modules with docstring.