Regression
Price Suggestion for Online Second-hand Items with Texts and Images
Han, Liang, Yin, Zhaozheng, Xia, Zhurong, Tang, Mingqian, Jin, Rong
This paper presents an intelligent price suggestion system for online second-hand listings based on their uploaded images and text descriptions. The goal of price prediction is to help sellers set effective and reasonable prices for their second-hand items with the images and text descriptions uploaded to the online platforms. Specifically, we design a multi-modal price suggestion system which takes as input the extracted visual and textual features along with some statistical item features collected from the second-hand item shopping platform to determine whether the image and text of an uploaded second-hand item are qualified for reasonable price suggestion with a binary classification model, and provide price suggestions for second-hand items with qualified images and text descriptions with a regression model. To satisfy different demands, two different constraints are added into the joint training of the classification model and the regression model. Moreover, a customized loss function is designed for optimizing the regression model to provide price suggestions for second-hand items, which can not only maximize the gain of the sellers but also facilitate the online transaction. We also derive a set of metrics to better evaluate the proposed price suggestion system. Extensive experiments on a large real-world dataset demonstrate the effectiveness of the proposed multi-modal price suggestion system.
Machine Learning With R: Logistic Regression
Our little journey to machine learning with R continues! Today's topic is logistic regression โ as an introduction to machine learning classification tasks. We'll cover data preparation, modeling, and evaluation of the well-known Titanic dataset. That's it for the introduction section โ we have many things to cover, so let's jump right to it. Logistic regression is a great introductory algorithm for binary classification (two class values) borrowed from the field of statistics.
Consistent regression of biophysical parameters with kernel methods
Dรญaz, Emiliano, Pรฉrez-Suay, Adriรกn, Laparra, Valero, Camps-Valls, Gustau
This paper introduces a novel statistical regression framework that allows the incorporation of consistency constraints. A linear and nonlinear (kernel-based) formulation are introduced, and both imply closed-form analytical solutions. The models exploit all the information from a set of drivers while being maximally independent of a set of auxiliary, protected variables. We successfully illustrate the performance in the estimation of chlorophyll content.
Optimal Survival Trees
Bertsimas, Dimitris, Dunn, Jack, Gibson, Emma, Orfanoudaki, Agni
Survival analysis methods are required for censored data in which the outcome of interest is generally the time until an event (onset of disease, death, etc.), but the exact time of the event is unknown (censored) for some individuals. When a lower bound for these missing values is known (for example, a patient is known to be alive until at least time t) the data is said to be right-censored. A common survival analysis technique is Cox proportional hazards regression (Cox, 1972) which models the hazard rate for an event as a linear combination of covariate effects. Although this model is widely used and easily interpreted, its parametric nature makes it unable to identify nonlinear effects or interactions between covariates (Bou-Hamad et al., 2011). Recursive partitioning techniques (also referred to as trees) are a popular alternative to parametric models. When applied to survival data, survival tree algorithms partition the covariate space into smaller and smaller regions (nodes) containing observations with homogeneous survival outcomes.
Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately
The presence of spurious features interferes with the goal of obtaining robust models that perform well across many groups within the population. A natural remedy is to remove spurious features from the model. However, in this work we show that removal of spurious features can decrease accuracy due to the inductive biases of overparameterized models. We completely characterize how the removal of spurious features affects accuracy across different groups (more generally, test distributions) in noiseless overparameterized linear regression. In addition, we show that removal of spurious feature can decrease the accuracy even in balanced datasets -- each target co-occurs equally with each spurious feature; and it can inadvertently make the model more susceptible to other spurious features. Finally, we show that robust self-training can remove spurious features without affecting the overall accuracy. Experiments on the Toxic-Comment-Detectoin and CelebA datasets show that our results hold in non-linear models.
A PAC-Bayesian Perspective on Structured Prediction with Implicit Loss Embeddings
Cantelobre, Thรฉophile, Guedj, Benjamin, Pรฉrez-Ortiz, Marรญa, Shawe-Taylor, John
Many practical machine learning tasks can be framed as Structured prediction problems, where several output variables are predicted and considered interdependent. Recent theoretical advances in structured prediction have focused on obtaining fast rates convergence guarantees, especially in the Implicit Loss Embedding (ILE) framework. PAC-Bayes has gained interest recently for its capacity of producing tight risk bounds for predictor distributions. This work proposes a novel PAC-Bayes perspective on the ILE Structured prediction framework. We present two generalization bounds, on the risk and excess risk, which yield insights into the behavior of ILE predictors. Two learning algorithms are derived from these bounds.
Explainable Artificial Intelligence: How Subsets of the Training Data Affect a Prediction
Brandsรฆter, Andreas, Glad, Ingrid K.
There is an increasing interest in and demand for interpretations and explanations of machine learning models and predictions in various application areas. In this paper, we consider data-driven models which are already developed, implemented and trained. Our goal is to interpret the models and explain and understand their predictions. Since the predictions made by data-driven models rely heavily on the data used for training, we believe explanations should convey information about how the training data affects the predictions. To do this, we propose a novel methodology which we call Shapley values for training data subset importance. The Shapley value concept originates from coalitional game theory, developed to fairly distribute the payout among a set of cooperating players. We extend this to subset importance, where a prediction is explained by treating the subsets of the training data as players in a game where the predictions are the payouts. We describe and illustrate how the proposed method can be useful and demonstrate its capabilities on several examples. We show how the proposed explanations can be used to reveal biasedness in models and erroneous training data. Furthermore, we demonstrate that when predictions are accurately explained in a known situation, then explanations of predictions by simple models correspond to the intuitive explanations. We argue that the explanations enable us to perceive more of the inner workings of the algorithms, and illustrate how models producing similar predictions can be based on very different parts of the training data. Finally, we show how we can use Shapley values for subset importance to enhance our training data acquisition, and by this reducing prediction error.
LOWESS Regression in Python: How to Discover Clear Patterns in Your Data?
Machine Learning is making huge leaps forward, with an increasing number of algorithms enabling us to solve complex real-world problems. This story is part of a deep dive series explaining the mechanics of Machine Learning algorithms. In addition to giving you an understanding of how ML algorithms work, it also provides you with Python examples to build your own ML models. Locally Weighted Scatterplot Smoothing sits within the family of regression algorithms under the umbrella of Supervised Learning. This means that you need a set of labeled data with a numerical target variable to train your model.
Urban Crowdsensing using Social Media: An Empirical Study on Transformer and Recurrent Neural Networks
Heng, Jerome, Liu, Junhua, Lim, Kwan Hui
An important aspect of urban planning is understanding crowd levels at various locations, which typically require the use of physical sensors. Such sensors are potentially costly and time consuming to implement on a large scale. To address this issue, we utilize publicly available social media datasets and use them as the basis for two urban sensing problems, namely event detection and crowd level prediction. One main contribution of this work is our collected dataset from Twitter and Flickr, alongside ground truth events. We demonstrate the usefulness of this dataset with two preliminary supervised learning approaches: firstly, a series of neural network models to determine if a social media post is related to an event and secondly a regression model using social media post counts to predict actual crowd levels. We discuss preliminary results from these tasks and highlight some challenges.