Goto

Collaborating Authors

 Regression


Business analytics meets artificial intelligence: Assessing the demand effects of discounts on Swiss train tickets

arXiv.org Machine Learning

We assess the demand effects of discounts on train tickets issued by the Swiss Federal Railways, the so-called `supersaver tickets', based on machine learning, a subfield of artificial intelligence. Considering a survey-based sample of buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate) predict buying behavior, namely: booking a trip otherwise not realized by train, buying a first- rather than second-class ticket, or rescheduling a trip (e.g.\ away from rush hours) when being offered a supersaver ticket. Predictive machine learning suggests that customer's age, demand-related information for a specific connection (like departure time and utilization), and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the buying decision increases weakly monotonically in the discount rate, we identify the discount rate's effect among `always buyers', who would have traveled even without a discount, based on our survey that asks about customer behavior in the absence of discounts. We find that on average, increasing the discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage points among always buyers. Investigating effect heterogeneity across observables suggests that the effects are higher for leisure travelers and during peak hours when controlling several other characteristics.


Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

arXiv.org Machine Learning

Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.


Data Science Quiz

#artificialintelligence

Interviews are the most challenging part of getting any job especially for Data Scientist and Machine Learning Engineer roles where you are tested on Machine Learning and Deep Learning concepts. So, Given below is a short quiz that consists of 25 Questions consisting of MCQs(One or more correct), True-False, and Integer Type Questions to check your knowledge. Explanation: The derivative of Leaky RELU activation function h(z) is 1 only for z 0, while for z 0, it has a very small value. Explanation: Residuals are vertical offset and the sum of residuals is always zero. Explanation: For deciding class w1, the conditional Risk for w1 is smaller than w2.


Directional FDR Control for Sub-Gaussian Sparse GLMs

arXiv.org Machine Learning

High-dimensional sparse generalized linear models (GLMs) have emerged in the setting that the number of samples and the dimension of variables are large, and even the dimension of variables grows faster than the number of samples. False discovery rate (FDR) control aims to identify some small number of statistically significantly nonzero results after getting the sparse penalized estimation of GLMs. Using the CLIME method for precision matrix estimations, we construct the debiased-Lasso estimator and prove the asymptotical normality by minimax-rate oracle inequalities for sparse GLMs. In practice, it is often needed to accurately judge each regression coefficient's positivity and negativity, which determines whether the predictor variable is positively or negatively related to the response variable conditionally on the rest variables. Using the debiased estimator, we establish multiple testing procedures. Under mild conditions, we show that the proposed debiased statistics can asymptotically control the directional (sign) FDR and directional false discovery variables at a pre-specified significance level. Moreover, it can be shown that our multiple testing procedure can approximately achieve a statistical power of 1. We also extend our methods to the two-sample problems and propose the two-sample test statistics. Under suitable conditions, we can asymptotically achieve directional FDR control and directional FDV control at the specified significance level for two-sample problems. Some numerical simulations have successfully verified the FDR control effects of our proposed testing procedures, which sometimes outperforms the classical knockoff method.



A Beginner's Guide to Regression Analysis in Machine Learning

#artificialintelligence

In order to understand the motivation behind regression, let's consider the following simple example. The scatter plot below shows the number of college graduates in the US from the year 2001 to 2012. Now based on the available data, what if someone asks you how many college graduates with master's degrees will there be in the year 2018? It can be seen that the number of college graduates with master's degrees increases almost linearly with the year. So by simple visual analysis, we can get a rough estimate of that number to be between 2.0 to 2.1 million.


2021 Python for Linear Regression in Machine Learning

#artificialintelligence

This course teaches you an in-depth analysis of Linear Regression. We cover the theory and coding part together for better understanding. You will lea


Linear Regression in Machine Learning

#artificialintelligence

Regression is the method used to predict the continuous variable in the target column or dependent variable based on independent features. It falls under the supervised technique. It is a statistical tool used to find out the relationship between the outcome variable, the dependent variable, and one or more variables often called independent variables. Linear regression is used for finding the linear relationship between the target and one or more predictors. Simple linear regression finds the relationship between the dependent (Y) and independent (X) and it tries to find the best fit line by minimizing the errors this fitness function says how good your model is, or you can define a cost function that measures how bad it is.


Machine Learning Project on Sales Prediction or Sale Forecast - Projects Based Learning

#artificialintelligence

It is easier for established companies to predict future sales based on years of past business data. Newly founded companies have to base their forecasts on less-verified information, such as market research and competitive intelligence to forecast their future business. Sales forecasting gives insight into how a company should manage its workforce, cash flow, and resources. In addition to helping a company allocate its internal resources effectively, predictive sales data is important for businesses when looking to acquire investment capital. Sales forecasting allows companies to: Predict achievable sales revenue; Efficiently allocate resources; Plan for future growth. In this project, looking at the various Stores Sales around the world are tasked with predicting their daily sales in advance.


Probabilistic water demand forecasting using quantile regression algorithms

#artificialintelligence

Machine and statistical learning algorithms can be reliably automated and applied at scale. Therefore, they can constitute a considerable asset for designing practical forecasting systems, such as those related to urban water demand. Quantile regression algorithms are statistical and machine learning algorithms that can provide probabilistic forecasts in a straightforward way, and have not been applied so far for urban water demand forecasting. In this work, we aim to fill this gap by automating and extensively comparing several quantile-regression-based practical systems for probabilistic one-day ahead urban water demand forecasting. For designing the practical systems, we use five individual algorithms (i.e., the quantile regression, linear boosting, generalized random forest, gradient boosting machine and quantile regression neural network algorithms), their mean combiner and their median combiner.