Regression
Machine learning-based characterization of hydrochar from biomass: Implications for sustainable energy and material production
Shafizadeh, Alireza, Shahbeik, Hossein, Rafiee, Shahin, Moradi, Aysooda, Shahbaz, Mohammadreza, Madadi, Meysam, Li, Cheng, Peng, Wanxi, Tabatabaei, Meisam, Aghbashlo, Mortaza
Hydrothermal carbonization (HTC) is a process that converts biomass into versatile hydrochar without the need for prior drying. The physicochemical properties of hydrochar are influenced by biomass properties and processing parameters, making it challenging to optimize for specific applications through trial-and-error experiments. To save time and money, machine learning can be used to develop a model that characterizes hydrochar produced from different biomass sources under varying reaction processing parameters. Thus, this study aims to develop an inclusive model to characterize hydrochar using a database covering a range of biomass types and reaction processing parameters. The quality and quantity of hydrochar are predicted using two models (decision tree regression and support vector regression). The decision tree regression model outperforms the support vector regression model in terms of forecast accuracy (R2 > 0.88, RMSE < 6.848, and MAE < 4.718). Using an evolutionary algorithm, optimum inputs are identified based on cost functions provided by the selected model to optimize hydrochar for energy production, soil amendment, and pollutant adsorption, resulting in hydrochar yields of 84.31%, 84.91%, and 80.40%, respectively. The feature importance analysis reveals that biomass ash/carbon content and operating temperature are the primary factors affecting hydrochar production in the HTC process.
Using evolutionary machine learning to characterize and optimize co-pyrolysis of biomass feedstocks and polymeric wastes
Shahbeik, Hossein, Shafizadeh, Alireza, Nadian, Mohammad Hossein, Jeddi, Dorsa, Mirjalili, Seyedali, Yang, Yadong, Lam, Su Shiung, Pan, Junting, Tabatabaei, Meisam, Aghbashlo, Mortaza
Co-pyrolysis of biomass feedstocks with polymeric wastes is a promising strategy for improving the quantity and quality parameters of the resulting liquid fuel. Numerous experimental measurements are typically conducted to find the optimal operating conditions. However, performing co-pyrolysis experiments is highly challenging due to the need for costly and lengthy procedures. Machine learning (ML) provides capabilities to cope with such issues by leveraging on existing data. This work aims to introduce an evolutionary ML approach to quantify the (by)products of the biomass-polymer co-pyrolysis process. A comprehensive dataset covering various biomass-polymer mixtures under a broad range of process conditions is compiled from the qualified literature. The database was subjected to statistical analysis and mechanistic discussion. The input features are constructed using an innovative approach to reflect the physics of the process. The constructed features are subjected to principal component analysis to reduce their dimensionality. The obtained scores are introduced into six ML models. Gaussian process regression model tuned by particle swarm optimization algorithm presents better prediction performance (R2 > 0.9, MAE < 0.03, and RMSE < 0.06) than other developed models. The multi-objective particle swarm optimization algorithm successfully finds optimal independent parameters.
Towards human-compatible autonomous car: A study of non-verbal Turing test in automated driving with affective transition modelling
Li, Zhaoning, Jiang, Qiaoli, Wu, Zhengming, Liu, Anqi, Wu, Haiyan, Huang, Miner, Huang, Kai, Ku, Yixuan
Autonomous cars are indispensable when humans go further down the hands-free route. Although existing literature highlights that the acceptance of the autonomous car will increase if it drives in a human-like manner, sparse research offers the naturalistic experience from a passenger's seat perspective to examine the humanness of current autonomous cars. The present study tested whether the AI driver could create a human-like ride experience for passengers based on 69 participants' feedback in a real-road scenario. We designed a ride experience-based version of the non-verbal Turing test for automated driving. Participants rode in autonomous cars (driven by either human or AI drivers) as a passenger and judged whether the driver was human or AI. The AI driver failed to pass our test because passengers detected the AI driver above chance. In contrast, when the human driver drove the car, the passengers' judgement was around chance. We further investigated how human passengers ascribe humanness in our test. Based on Lewin's field theory, we advanced a computational model combining signal detection theory with pre-trained language models to predict passengers' humanness rating behaviour. We employed affective transition between pre-study baseline emotions and corresponding post-stage emotions as the signal strength of our model. Results showed that the passengers' ascription of humanness would increase with the greater affective transition. Our study suggested an important role of affective transition in passengers' ascription of humanness, which might become a future direction for autonomous driving.
A Robust Classifier Under Missing-Not-At-Random Sample Selection Bias
Mai, Huy, Huang, Wen, Du, Wei, Wu, Xintao
The shift between the training and testing distributions is commonly due to sample selection bias, a type of bias caused by non-random sampling of examples to be included in the training set. Although there are many approaches proposed to learn a classifier under sample selection bias, few address the case where a subset of labels in the training set are missing-not-at-random (MNAR) as a result of the selection process. In statistics, Greene's method formulates this type of sample selection with logistic regression as the prediction model. However, we find that simply integrating this method into a robust classification framework is not effective for this bias setting. In this paper, we propose BiasCorr, an algorithm that improves on Greene's method by modifying the original training set in order for a classifier to learn under MNAR sample selection bias. We provide theoretical guarantee for the improvement of BiasCorr over Greene's method by analyzing its bias. Experimental results on real-world datasets demonstrate that BiasCorr produces robust classifiers and can be extended to outperform state-of-the-art classifiers that have been proposed to train under sample selection bias.
On Consistency of Signatures Using Lasso
Guo, Xin, Zhang, Ruixun, Zhao, Chaoyi
Signature transforms are iterated path integrals of continuous and discrete-time time series data, and their universal nonlinearity linearizes the problem of feature selection. This paper revisits the consistency issue of Lasso regression for the signature transform, both theoretically and numerically. Our study shows that, for processes and time series that are closer to Brownian motion or random walk with weaker inter-dimensional correlations, the Lasso regression is more consistent for their signatures defined by It\^o integrals; for mean reverting processes and time series, their signatures defined by Stratonovich integrals have more consistency in the Lasso regression. Our findings highlight the importance of choosing appropriate definitions of signatures and stochastic models in statistical inference and machine learning.
Covariate balancing using the integral probability metric for causal inference
Kong, Insung, Park, Yuha, Jung, Joonhyuk, Lee, Kwonsang, Kim, Yongdai
Weighting methods in causal inference have been widely used to achieve a desirable level of covariate balancing. However, the existing weighting methods have desirable theoretical properties only when a certain model, either the propensity score or outcome regression model, is correctly specified. In addition, the corresponding estimators do not behave well for finite samples due to large variance even when the model is correctly specified. In this paper, we consider to use the integral probability metric (IPM), which is a metric between two probability measures, for covariate balancing. Optimal weights are determined so that weighted empirical distributions for the treated and control groups have the smallest IPM value for a given set of discriminators. We prove that the corresponding estimator can be consistent without correctly specifying any model (neither the propensity score nor the outcome regression model). In addition, we empirically show that our proposed method outperforms existing weighting methods with large margins for finite samples.
On the robust learning mixtures of linear regressions
In this note, we consider the problem of robust learning mixtures of linear regressions. We connect mixtures of linear regressions and mixtures of Gaussians with a simple thresholding, so that a quasi-polynomial time algorithm can be obtained under some mild separation condition. This algorithm has significantly better robustness than the previous result.
A parametric distribution for exact post-selection inference with data carving
Post-selection inference (PoSI) is a statistical technique for obtaining valid confidence intervals and p-values when hypothesis generation and testing use the same source of data. PoSI can be used on a range of popular algorithms including the Lasso. Data carving is a variant of PoSI in which a portion of held out data is combined with the hypothesis generating data at inference time. While data carving has attractive theoretical and empirical properties, existing approaches rely on computationally expensive MCMC methods to carry out inference. This paper's key contribution is to show that pivotal quantities can be constructed for the data carving procedure based on a known parametric distribution. Specifically, when the selection event is characterized by a set of polyhedral constraints on a Gaussian response, data carving will follow the sum of a normal and a truncated normal (SNTN), which is a variant of the truncated bivariate normal distribution. The main impact of this insight is that obtaining exact inference for data carving can be made computationally trivial, since the CDF of the SNTN distribution can be found using the CDF of a standard bivariate normal. A python package sntn has been released to further facilitate the adoption of data carving with PoSI.
Contrastive inverse regression for dimension reduction
Hawke, Sam, Luo, Hengrui, Li, Didong
Supervised dimension reduction (SDR) has been a topic of growing interest in data science, as it enables the reduction of high-dimensional covariates while preserving the functional relation with certain response variables of interest. However, existing SDR methods are not suitable for analyzing datasets collected from case-control studies. In this setting, the goal is to learn and exploit the low-dimensional structure unique to or enriched by the case group, also known as the foreground group. While some unsupervised techniques such as the contrastive latent variable model and its variants have been developed for this purpose, they fail to preserve the functional relationship between the dimension-reduced covariates and the response variable. In this paper, we propose a supervised dimension reduction method called contrastive inverse regression (CIR) specifically designed for the contrastive setting. CIR introduces an optimization problem defined on the Stiefel manifold with a non-standard loss function. We prove the convergence of CIR to a local optimum using a gradient descent-based algorithm, and our numerical study empirically demonstrates the improved performance over competing methods for high-dimensional data.
A Novel Framework for Improving the Breakdown Point of Robust Regression Algorithms
Fan, Zheyi, Ng, Szu Hui, Hu, Qingpei
We present an effective framework for improving the breakdown point of robust regression algorithms. Robust regression has attracted widespread attention due to the ubiquity of outliers, which significantly affect the estimation results. However, many existing robust least-squares regression algorithms suffer from a low breakdown point, as they become stuck around local optima when facing severe attacks. By expanding on the previous work, we propose a novel framework that enhances the breakdown point of these algorithms by inserting a prior distribution in each iteration step, and adjusting the prior distribution according to historical information. We apply this framework to a specific algorithm and derive the consistent robust regression algorithm with iterative local search (CORALS). The relationship between CORALS and momentum gradient descent is described, and a detailed proof of the theoretical convergence of CORALS is presented. Finally, we demonstrate that the breakdown point of CORALS is indeed higher than that of the algorithm from which it is derived. We apply the proposed framework to other robust algorithms, and show that the improved algorithms achieve better results than the original algorithms, indicating the effectiveness of the proposed framework.