Regression
Twin support vector quantile regression
Ye, Yafen, Xu, Zhihu, Zhang, Jinhua, Chen, Weijie, Shao, Yuanhai
We propose a twin support vector quantile regression (TSVQR) to capture the heterogeneous and asymmetric information in modern data. Using a quantile parameter, TSVQR effectively depicts the heterogeneous distribution information with respect to all portions of data points. Correspondingly, TSVQR constructs two smaller sized quadratic programming problems (QPPs) to generate two nonparallel planes to measure the distributional asymmetry between the lower and upper bounds at each quantile level. The QPPs in TSVQR are smaller and easier to solve than those in previous quantile regression methods. Moreover, the dual coordinate descent algorithm for TSVQR also accelerates the training speed. Experimental results on six artiffcial data sets, ffve benchmark data sets, two large scale data sets, two time-series data sets, and two imbalanced data sets indicate that the TSVQR outperforms previous quantile regression methods in terms of the effectiveness of completely capturing the heterogeneous and asymmetric information and the efffciency of the learning process.
Carbon Price Forecasting with Quantile Regression and Feature Selection
Pang, Tianqi, Tan, Kehui, Fan, Chenyou
Carbon futures has recently emerged as a novel financial asset in the trading markets such as the European Union and China. Monitoring the trend of the carbon price has become critical for both national policy-making as well as industrial manufacturing planning. However, various geopolitical, social, and economic factors can impose substantial influence on the carbon price. Due to its volatility and non-linearity, predicting accurate carbon prices is generally a difficult task. In this study, we propose to improve carbon price forecasting with several novel practices. First, we collect various influencing factors, including commodity prices, export volumes such as oil and natural gas, and prosperity indices. Then we select the most significant factors and disclose their optimal grouping for explainability. Finally, we use the Sparse Quantile Group Lasso and Adaptive Sparse Quantile Group Lasso for robust price predictions. We demonstrate through extensive experimental studies that our proposed methods outperform existing ones. Also, our quantile predictions provide a complete profile of future prices at different levels, which better describes the distributions of the carbon market.
Exploring the impact of weather on Metro demand forecasting using machine learning method
Hu, Yiming, Huang, Yangchuan, Liu, Shuying, Qi, Yuanyang, Bai, Danhui
Urban rail transit provides significant comprehensive benefits such as large traffic volume and high speed, serving as one of the most important components of urban traffic construction management and congestion solution. Using real passenger flow data of an Asian subway system from April to June of 2018, this work analyzes the space-time distribution of the passenger flow using short-term traffic flow prediction. Stations are divided into four types for passenger flow forecasting, and meteorological records are collected for the same period. Then, machine learning methods with different inputs are applied and multivariate regression is performed to evaluate the improvement effect of each weather element on passenger flow forecasting of representative metro stations on hourly basis. Our results show that by inputting weather variables the precision of prediction on weekends enhanced while the performance on weekdays only improved marginally, while the contribution of different elements of weather differ. Also, different categories of stations are affected differently by weather. This study provides a possible method to further improve other prediction models, and attests to the promise of data-driven analytics for optimization of short-term scheduling in transit management.
Judge Me in Context: A Telematics-Based Driving Risk Prediction Framework in Presence of Weak Risk Labels
Moosavi, Sobhan, Ramnath, Rajiv
Driving risk prediction has been a topic of much research over the past few decades to minimize driving risk and increase safety. The use of demographic information in risk prediction is a traditional solution with applications in insurance planning, however, it is difficult to capture true driving behavior via such coarse-grained factors. Therefor, the use of telematics data has gained a widespread popularity over the past decade. While most of the existing studies leverage demographic information in addition to telematics data, our objective is to maximize the use of telematics as well as contextual information (e.g., road-type) to build a risk prediction framework with real-world applications. We contextualize telematics data in a variety of forms, and then use it to develop a risk classifier, assuming that there are some weak risk labels available (e.g., past traffic citation records). Before building a risk classifier though, we employ a novel data-driven process to augment weak risk labels. Extensive analysis and results based on real-world data from multiple major cities in the United States demonstrate usefulness of the proposed framework.
Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey
Chen, Tao, Lv, Liang, Wang, Di, Zhang, Jing, Yang, Yue, Zhao, Zeyang, Wang, Chen, Guo, Xiaowei, Chen, Hao, Wang, Qingye, Xu, Yufei, Zhang, Qiming, Du, Bo, Zhang, Liangpei, Tao, Dacheng
With the world population rapidly increasing, transforming our agrifood systems to be more productive, efficient, safe, and sustainable is crucial to mitigate potential food shortages. Recently, artificial intelligence (AI) techniques such as deep learning (DL) have demonstrated their strong abilities in various areas, including language, vision, remote sensing (RS), and agrifood systems applications. However, the overall impact of AI on agrifood systems remains unclear. In this paper, we thoroughly review how AI techniques can transform agrifood systems and contribute to the modern agrifood industry. Firstly, we summarize the data acquisition methods in agrifood systems, including acquisition, storage, and processing techniques. Secondly, we present a progress review of AI methods in agrifood systems, specifically in agriculture, animal husbandry, and fishery, covering topics such as agrifood classification, growth monitoring, yield prediction, and quality assessment. Furthermore, we highlight potential challenges and promising research opportunities for transforming modern agrifood systems with AI. We hope this survey could offer an overall picture to newcomers in the field and serve as a starting point for their further research.
Explore the difficulty of words and its influential attributes based on the Wordle game
Liu, Beibei, Zhang, Yuanfang, Zhang, Shiyu
We adopt the distribution and expectation of guessing times in game Wordle as metrics to predict the difficulty of words and explore their influence factors. In order to predictthe difficulty distribution, we use Monte Carlo to simulate the guessing process of players and then narrow the gap between raw and actual distribution of guessing times for each word with Markov which generates the associativity of words. Afterwards, we take advantage of lasso regression to predict the deviation of guessing times expectation and quadratic programming to obtain the correction of the original distribution.To predict the difficulty levels, we first use hierarchical clustering to classify the difficulty levels based on the expectation of guessing times. Afterwards we downscale the variables of lexical attributes based on factor analysis. Significant factors include the number of neighboring words, letter similarity, sub-string similarity, and word frequency. Finally, we build the relationship between lexical attributes and difficulty levels through ordered logistic regression.
Semisupervised regression in latent structure networks on unknown manifolds
Acharyya, Aranyak, Agterberg, Joshua, Trosset, Michael W., Park, Youngser, Priebe, Carey E.
Random graphs are increasingly becoming objects of interest for modeling networks in a wide range of applications. Latent position random graph models posit that each node is associated with a latent position vector, and that these vectors follow some geometric structure in the latent space. In this paper, we consider random dot product graphs, in which an edge is formed between two nodes with probability given by the inner product of their respective latent positions. We assume that the latent position vectors lie on an unknown one-dimensional curve and are coupled with a response covariate via a regression model. Using the geometry of the underlying latent position vectors, we propose a manifold learning and graph embedding technique to predict the response variable on out-of-sample nodes, and we establish convergence guarantees for these responses. Our theoretical results are supported by simulations and an application to Drosophila brain data.
3D Laser-and-tissue Agnostic Data-driven Method for Robotic Laser Surgical Planning
Ma, Guangshen, Prakash, Ravi, Mann, Brian, Ross, Weston, Codd, Patrick
In robotic laser surgery, shape prediction of an one-shot ablation cavity is an important problem for minimizing errant overcutting of healthy tissue during the course of pathological tissue resection and precise tumor removal. Since it is difficult to physically model the laser-tissue interaction due to the variety of optical tissue properties, complicated process of heat transfer, and uncertainty about the chemical reaction, we propose a 3D cavity prediction model based on an entirely data-driven method without any assumptions of laser settings and tissue properties. Based on the cavity prediction model, we formulate a novel robotic laser planning problem to determine the optimal laser incident configuration, which aims to create a cavity that aligns with the surface target (e.g. tumor, pathological tissue). To solve the one-shot ablation cavity prediction problem, we model the 3D geometric relation between the tissue surface and the laser energy profile as a non-linear regression problem that can be represented by a single-layer perceptron (SLP) network. The SLP network is encoded in a novel kinematic model to predict the shape of the post-ablation cavity with an arbitrary laser input. To estimate the SLP network parameters, we formulate a dataset of one-shot laser-phantom cavities reconstructed by the optical coherence tomography (OCT) B-scan images for the data-driven modelling. To verify the method. The learned cavity prediction model is applied to solve a simplified robotic laser planning problem modelled as a surface alignment error minimization problem. The initial results report (91.1 +- 3.0)% 3D-cavity-Intersection-over-Union (3D-cavity-IoU) for the 3D cavity prediction and an average of 97.9% success rate for the simulated surface alignment experiments.
Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria
Liao, Yiqiao, Naghizadeh, Parinaz
Although many fairness criteria have been proposed to ensure that machine learning algorithms do not exhibit or amplify our existing social biases, these algorithms are trained on datasets that can themselves be statistically biased. In this paper, we investigate the robustness of a number of existing (demographic) fairness criteria when the algorithm is trained on biased data. We consider two forms of dataset bias: errors by prior decision makers in the labeling process, and errors in measurement of the features of disadvantaged individuals. We analytically show that some constraints (such as Demographic Parity) can remain robust when facing certain statistical biases, while others (such as Equalized Odds) are significantly violated if trained on biased data. We also analyze the sensitivity of these criteria and the decision maker's utility to biases. We provide numerical experiments based on three real-world datasets (the FICO, Adult, and German credit score datasets) supporting our analytical findings. Our findings present an additional guideline for choosing among existing fairness criteria, or for proposing new criteria, when available datasets may be biased.
Estimating the Density Ratio between Distributions with High Discrepancy using Multinomial Logistic Regression
Srivastava, Akash, Han, Seungwook, Xu, Kai, Rhodes, Benjamin, Gutmann, Michael U.
Functions of the ratio of the densities $p/q$ are widely used in machine learning to quantify the discrepancy between the two distributions $p$ and $q$. For high-dimensional distributions, binary classification-based density ratio estimators have shown great promise. However, when densities are well separated, estimating the density ratio with a binary classifier is challenging. In this work, we show that the state-of-the-art density ratio estimators perform poorly on well-separated cases and demonstrate that this is due to distribution shifts between training and evaluation time. We present an alternative method that leverages multi-class classification for density ratio estimation and does not suffer from distribution shift issues. The method uses a set of auxiliary densities $\{m_k\}_{k=1}^K$ and trains a multi-class logistic regression to classify the samples from $p, q$, and $\{m_k\}_{k=1}^K$ into $K+2$ classes. We show that if these auxiliary densities are constructed such that they overlap with $p$ and $q$, then a multi-class logistic regression allows for estimating $\log p/q$ on the domain of any of the $K+2$ distributions and resolves the distribution shift problems of the current state-of-the-art methods. We compare our method to state-of-the-art density ratio estimators on both synthetic and real datasets and demonstrate its superior performance on the tasks of density ratio estimation, mutual information estimation, and representation learning. Code: https://www.blackswhan.com/mdre/