Regression
An Upgraded Marketing Mix Modeling in Python
In my last article, I introduced you to the world of marketing mix modeling. If you have not read it so far, please do before you proceed. There, we have a created a linear regression model that is able to predict sales based on raw advertising spends in several advertising channels, such as TV, radio, web banners. For me as a machine learning practitioner, such a model is nice already on its own. Even better, it also makes business people happy because the model lets us calculate ROIs, allowing us to judge how well each channel performed.
Introduction to Polynomial Regression Analysis
Polynomial regression is one of the machine learning algorithms used for making predictions. For example, polynomial regression is widely applied to predict the spread rate of COVID-19 and other infectious diseases. If you would like to learn more about what polynomial regression analysis is, continue reading. Regression analysis is a helpful statistical tool for studying the correlation between two sets of events, or, statistically speaking, variables ― between a dependent variable and one or more independent variables. For example, your weight loss (dependent variable) depends on the number of hours you spend in the gym (independent variable).
ML-Logistic Regression
There are other optimization algorithms than gradient descent. These algorithms automatically pick the appropriate learning rate alpha, and are usually faster. One way to do this is to do a "One vs all" binary classification. To do this we do a binary classification with a certain class and all the other classes, and select the largest one that has the largest hypothesis output. Since we have 3 classes here, we do the binary classification 3 times.
A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification
Herde, Marek, Huseljic, Denis, Sick, Bernhard, Calma, Adrian
Pool-based active learning (AL) aims to optimize the annotation process (i.e., labeling) as the acquisition of annotations is often time-consuming and therefore expensive. For this purpose, an AL strategy queries annotations intelligently from annotators to train a high-performance classification model at a low annotation cost. Traditional AL strategies operate in an idealized framework. They assume a single, omniscient annotator who never gets tired and charges uniformly regardless of query difficulty. However, in real-world applications, we often face human annotators, e.g., crowd or in-house workers, who make annotation mistakes and can be reluctant to respond if tired or faced with complex queries. Recently, a wide range of novel AL strategies has been proposed to address these issues. They differ in at least one of the following three central aspects from traditional AL: (1) They explicitly consider (multiple) human annotators whose performances can be affected by various factors, such as missing expertise. (2) They generalize the interaction with human annotators by considering different query and annotation types, such as asking an annotator for feedback on an inferred classification rule. (3) They take more complex cost schemes regarding annotations and misclassifications into account. This survey provides an overview of these AL strategies and refers to them as real-world AL. Therefore, we introduce a general real-world AL strategy as part of a learning cycle and use its elements, e.g., the query and annotator selection algorithm, to categorize about 60 real-world AL strategies. Finally, we outline possible directions for future research in the field of AL.
High-dimensional regression with potential prior information on variable importance
Stokell, Benjamin G., Shah, Rajen D.
There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
Amazon SageMaker tutorial and model
This code pattern describes a way to gain insights by using Watson OpenScale and a SageMaker machine learning model. It explains how to create a logistic regression model using Amazon SageMaker with data from the UC Irvine machine learning database. The pattern uses Watson OpenScale to bind the machine learning model deployed in the AWS cloud, create a subscription, and perform payload and feedback logging. With Watson OpenScale, you can monitor model quality and log payloads, regardless of where the model is hosted. This code pattern uses the example of an Amazon Web Service (AWS) SageMaker model, which demonstrates the independent and open nature of Watson OpenScale.
Facilitating human-wildlife cohabitation through conflict prediction
Ghosh, Susobhan, Varakantham, Pradeep, Bhatkhande, Aniket, Ahmad, Tamanna, Andheria, Anish, Li, Wenjun, Taneja, Aparna, Thakkar, Divy, Tambe, Milind
With increasing world population and expanded use of forests as cohabited regions, interactions and conflicts with wildlife are increasing, leading to large-scale loss of lives (animal and human) and livelihoods (economic). While community knowledge is valuable, forest officials and conservation organisations can greatly benefit from predictive analysis of human-wildlife conflict, leading to targeted interventions that can potentially help save lives and livelihoods. However, the problem of prediction is a complex socio-technical problem in the context of limited data in low-resource regions. Identifying the "right" features to make accurate predictions of conflicts at the required spatial granularity using a sparse conflict training dataset} is the key challenge that we address in this paper. Specifically, we do an illustrative case study on human-wildlife conflicts in the Bramhapuri Forest Division in Chandrapur, Maharashtra, India. Most existing work has considered human-wildlife conflicts in protected areas and to the best of our knowledge, this is the first effort at prediction of human-wildlife conflicts in unprotected areas and using those predictions for deploying interventions on the ground.
Supervised Learning algorithms cheat-sheet
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used later for mapping new examples. The most popular supervised learning tasks are: Regression and Classification. The result of solving the regression task is a model that can make numerical predictions. The result of solving the classification task is a model that can make classes predictions.
Simple Linear Regression
Linear regression is an algorithm used to predict or visualise a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. Let us build our first Simple Linear Regression Model and learn along the way by building. This particular model is called as simple because it has only one independent variable. Here we are using the data containing people's salary and working experience to predict someone's salary based on their experience.
Personalized Online Machine Learning
Malenica, Ivana, Phillips, Rachael V., Pirracchio, Romain, Chambaz, Antoine, Hubbard, Alan, van der Laan, Mark J.
In this work, we introduce the Personalized Online Super Learner (POSL) -- an online ensembling algorithm for streaming data whose optimization procedure accommodates varying degrees of personalization. Namely, POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized (i.e., optimization with respect to baseline covariate subject ID) to many individuals (i.e., optimization with respect to common baseline covariates). As an online algorithm, POSL learns in real-time. POSL can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed algorithms that are never updated during the procedure, pooled algorithms that learn from many individuals' time-series, and individualized algorithms that learn from within a single time-series. POSL's ensembling of this hybrid of base learning strategies depends on the amount of data collected, the stationarity of the time-series, and the mutual characteristics of a group of time-series. In essence, POSL decides whether to learn across samples, through time, or both, based on the underlying (unknown) structure in the data. For a wide range of simulations that reflect realistic forecasting scenarios, and in a medical data application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for time-series data and adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time-series enter/exit dynamically over chronological time.