Goto

Collaborating Authors

 Regression


Data Augmentation for Mental Health Classification on Social Media

arXiv.org Artificial Intelligence

The mental disorder of online users is determined using social media posts. The major challenge in this domain is to avail the ethical clearance for using the user generated text on social media platforms. Academic re searchers identified the problem of insufficient and unlabeled data for mental health classification. To handle this issue, we have studied the effect of data augmentation techniques on domain specific user generated text for mental health classification. Among the existing well established data augmentation techniques, we have identified Easy Data Augmentation (EDA), conditional BERT, and Back Translation (BT) as the potential techniques for generating additional text to improve the performance of classifiers. Further, three different classifiers Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression (LR) are employed for analyzing the impact of data augmentation on two publicly available social media datasets. The experiments mental results show significant improvements in classifiers performance when trained on the augmented data.


All the Statistical Tests You Must Do for a Good Linear Regression

#artificialintelligence

The idea of this post is to show the many statistical tests that are around a Linear Regression. I know that it may sound repetitive ("Yet another post about Linear Regression"), but the information I am about to write about is not widely spread as we may think. Don't worry, I will leave the entire code at the end, where you will be able to see what I have imported for each test. As dataset, I will be using a "toy dataset" from sklearn about wines. For modeling and testing, I will use statsmodels, as it has all of the tests needed in the library.


Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

arXiv.org Machine Learning

We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.


Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach

arXiv.org Artificial Intelligence

Deep reinforcement learning (DRL) has been widely studied in the portfolio management task. However, it is challenging to understand a DRL-based trading strategy because of the black-box nature of deep neural networks. In this paper, we propose an empirical approach to explain the strategies of DRL agents for the portfolio management task. First, we use a linear model in hindsight as the reference model, which finds the best portfolio weights by assuming knowing actual stock returns in foresight. In particular, we use the coefficients of a linear model in hindsight as the reference feature weights. Secondly, for DRL agents, we use integrated gradients to define the feature weights, which are the coefficients between reward and features under a linear regression model. Thirdly, we study the prediction power in two cases, single-step prediction and multi-step prediction. In particular, we quantify the prediction power by calculating the linear correlations between the feature weights of a DRL agent and the reference feature weights, and similarly for machine learning methods. Finally, we evaluate a portfolio management task on Dow Jones 30 constituent stocks during 01/01/2009 to 09/01/2021. Our approach empirically reveals that a DRL agent exhibits a stronger multi-step prediction power than machine learning methods.


TensorFlow - Hands-on Machine Learning with TensorFlow

#artificialintelligence

Learn how to build Machine Learning projects in this TensorFlow Course created by The Click Reader. In this course, you will be learning about Scalar as well as Tensors and how to create them using TensorFlow. You will also be learning how to perform various kinds of Tensor operations for manipulating and changing tensor values.


Trend Following with Logistic Regression

#artificialintelligence

In this post, we'll cover a pragmatic logistic regression classifier to mimic a trend following strategy for the S&P 500 ETF, SPY. The pipeline takes in daily prices for SPY along with several SPDR sector ETFs and macro ETFs for gold, Yen, Swiss Franc etc. Once all Open, High, Low, Close, and Volume data has been received from yfinance, a feature space (the set of columns if thinking in a spreadsheets world) is built using select indicators included in TA-lib. The features are then reduced to 4 n-components with Principal Component Analysis; the model is trained on these n principal components, using ground truth labels generated by a brute force optimized dual moving average crossover. Initially, I opted to use the default boundary of .5 for the binary classification. On visual inspection, there is a gap in this logic -- as the classifier appears exceedingly optimistic (subjective).


Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data

arXiv.org Machine Learning

Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We investigate the use of three sampling approaches to adjust for imbalance. The goal is to downsample the covariates in the training data and subsequently fit a regression model. We investigate how the predictive power of the model changes when using either the sampled or the original data for training. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production and find that fitting a model using the sampled data gives a small reduction in the overall predictive performance, but yields a systematically better performance on underrepresented observations. In addition, the results emphasize the need for alternative, fair, and balanced model evaluations.


Consumer adoption of telemedicine in 2021

#artificialintelligence

Thank you to the Stanford Center of Digital Health for their continued collaboration on this work, with special gratitude to Natasha Din, MD, Clark Seninger, MBA, Sravya Rallapalli, Ashish Sarraju, MD, James Tooley, MD, Krishna Pundi, MD, Mario Funes-Hernandez, MD, and Mintu Turakhia, MD. Nearly two years into the COVID-19 pandemic, more consumers have used telemedicine than ever before. Venture investment in telemedicine is up, and big and small players are making land grabs for their share of the market, with many rolling out virtual–first care offerings. So with these accelerants--balanced with the full return of in-person care--what's the state of telemedicine? To answer this question and many more, we have surveyed U.S. adults every year since 2015 to check in with consumers and their relationship to digital health.


Variable Selection and Regularization via Arbitrary Rectangle-range Generalized Elastic Net

arXiv.org Machine Learning

We introduce the arbitrary rectangle-range generalized elastic net penalty method, abbreviated to ARGEN, for performing constrained variable selection and regularization in high-dimensional sparse linear models. As a natural extension of the nonnegative elastic net penalty method, ARGEN is proved to have variable selection consistency and estimation consistency under some conditions. The asymptotic behavior in distribution of the ARGEN estimators have been studied. We also propose an algorithm called MU-QP-RR-W-$l_1$ to efficiently solve ARGEN. By conducting simulation study we show that ARGEN outperforms the elastic net in a number of settings. Finally an application of S&P 500 index tracking with constraints on the stock allocations is performed to provide general guidance for adapting ARGEN to solve real-world problems.


Triangulation candidates for Bayesian optimization

arXiv.org Machine Learning

Bayesian optimization is a form of sequential design: idealize input-output relationships with a suitably flexible nonlinear regression model; fit to data from an initial experimental campaign; devise and optimize a criterion for selecting the next experimental condition(s) under the fitted model (e.g., via predictive equations) to target outcomes of interest (say minima); repeat after acquiring output under those conditions and updating the fit. In many situations this "inner optimization" over the new-data acquisition criterion is cumbersome because it is non-convex/highly multi-modal, may be non-differentiable, or may otherwise thwart numerical optimizers, especially when inference requires Monte Carlo. In such cases it is not uncommon to replace continuous search with a discrete one over random candidates. Here we propose using candidates based on a Delaunay triangulation of the existing input design. In addition to detailing construction of these "tricands", based on a simple wrapper around a conventional convex hull library, we promote several advantages based on properties of the geometric criterion involved. We then demonstrate empirically how tricands can lead to better Bayesian optimization performance compared to both numerically optimized acquisitions and random candidate-based alternatives on benchmark problems.