Regression
Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews
Boluki, Ali, Sharami, Javad Pourmostafa Roshan, Shterionov, Dimitar
Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unnoticed because of the limited attention span of readers. The problem of review helpfulness prediction is even more important for higher review volumes, and newly written reviews or launched products. In this work we compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews. The contributions of our work in relation to literature include extensively investigating the efficacy of state-of-the-art language models -- both monolingual and multilingual -- against a robust baseline, taking ranking metrics into account when assessing these approaches, and assessing multilingual models for the first time. We employ the Amazon review dataset for our experiments. According to our study on several product categories, multilingual and monolingual pre-trained language models outperform the baseline that utilizes random forest with handcrafted features as much as 23% in RMSE. Pre-trained language models reduce the need for complex text feature engineering. However, our results suggest that pre-trained multilingual models may not be used for fine-tuning only one language. We assess the performance of language models with and without additional features. Our results show that including additional features like product rating by the reviewer can further help the predictive methods.
Noise-Augmented $\ell_0$ Regularization of Tensor Regression with Tucker Decomposition
Yan, Tian, Li, Yinan, Liu, Fang
Tensor data are multi-dimension arrays. Low-rank decomposition-based regression methods with tensor predictors exploit the structural information in tensor predictors while significantly reducing the number of parameters in tensor regression. We propose a method named NA$_0$CT$^2$ (Noise Augmentation for $\ell_0$ regularization on Core Tensor in Tucker decomposition) to regularize the parameters in tensor regression (TR), coupled with Tucker decomposition. We establish theoretically that NA$_0$CT$^2$ achieves exact $\ell_0$ regularization in linear TR and generalized linear TR on the core tensor from the Tucker decomposition. To our knowledge, NA$_0$CT$^2$ is the first Tucker decomposition-based regularization method in TR to achieve $\ell_0$ in core tensor. NA$_0$CT$^2$ is implemented through an iterative procedure and involves two simple steps in each iteration -- generating noisy data based on the core tensor from the Tucker decomposition of the updated parameter estimate and running a regular GLM on noise-augmented data on vectorized predictors. We demonstrate the implementation of NA$_0$CT$^2$ and its $\ell_0$ regularization effect in both simulation studies and real data applications. The results suggest that NA$_0$CT$^2$ improves predictions compared to other decomposition-based TR approaches, with or without regularization and it also helps to identify important predictors though not designed for that purpose.
A Genetic Algorithm-based Framework for Learning Statistical Power Manifold
Umrawal, Abhishek K., Lane, Sean P., Hennes, Erin P.
Statistical power is a measure of the replicability of a categorical hypothesis test. Formally, it is the probability of detecting an effect, if there is a true effect present in the population. Hence, optimizing statistical power as a function of some parameters of a hypothesis test is desirable. However, for most hypothesis tests, the explicit functional form of statistical power for individual model parameters is unknown; but calculating power for a given set of values of those parameters is possible using simulated experiments. These simulated experiments are usually computationally expensive. Hence, developing the entire statistical power manifold using simulations can be very time-consuming. We propose a novel genetic algorithm-based framework for learning statistical power manifolds. For a multiple linear regression $F$-test, we show that the proposed algorithm/framework learns the statistical power manifold much faster as compared to a brute-force approach as the number of queries to the power oracle is significantly reduced. We also show that the quality of learning the manifold improves as the number of iterations increases for the genetic algorithm. Such tools are useful for evaluating statistical power trade-offs when researchers have little information regarding a priori best guesses of primary effect sizes of interest or how sampling variability in non-primary effects impacts power for primary ones.
Data Augmentation for Imbalanced Regression
Stocksieker, Samuel, Pommeret, Denys, Charpentier, Arthur
In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied.
Dual Graph Multitask Framework for Imbalanced Delivery Time Estimation
Zhang, Lei, Wang, Mingliang, Zhou, Xin, Wu, Xingyu, Cao, Yiming, Xu, Yonghui, Cui, Lizhen, Shen, Zhiqi
Delivery Time Estimation (DTE) is a crucial component of the e-commerce supply chain that predicts delivery time based on merchant information, sending address, receiving address, and payment time. Accurate DTE can boost platform revenue and reduce customer complaints and refunds. However, the imbalanced nature of industrial data impedes previous models from reaching satisfactory prediction performance. Although imbalanced regression methods can be applied to the DTE task, we experimentally find that they improve the prediction performance of low-shot data samples at the sacrifice of overall performance. To address the issue, we propose a novel Dual Graph Multitask framework for imbalanced Delivery Time Estimation (DGM-DTE). Our framework first classifies package delivery time as head and tail data. Then, a dual graph-based model is utilized to learn representations of the two categories of data. In particular, DGM-DTE re-weights the embedding of tail data by estimating its kernel density. We fuse two graph-based representations to capture both high- and low-shot data representations. Experiments on real-world Taobao logistics datasets demonstrate the superior performance of DGM-DTE compared to baselines.
Bayesian Federated Inference for Statistical Models
Jonker, Marianne A., Pazira, Hassan, Coolen, Anthony CC
Identifying predictive factors via multivariable statistical analysis is for rare diseases often impossible because the data sets available are too small. Combining data from different medical centers into a single (larger) database would alleviate this problem, but is in practice challenging due to regulatory and logistic problems. Federated Learning (FL) is a machine learning approach that aims to construct from local inferences in separate data centers what would have been inferred had the data sets been merged. It seeks to harvest the statistical power of larger data sets without actually creating them. The FL strategy is not always feasible for small data sets. Therefore, in this paper we refine and implement an alternative Bayesian Federated Inference (BFI) framework for multi center data with the same aim as FL. The BFI framework is designed to cope with small data sets by inferring locally not only the optimal parameter values, but also additional features of the posterior parameter distribution, capturing information beyond that is used in FL. BFI has the additional benefit that a single inference cycle across the centers is sufficient, whereas FL needs multiple cycles. We quantify the performance of the proposed methodology on simulated and real life data.
AI/ML Algorithms and Applications in VLSI Design and Technology
Amuru, Deepthi, Vudumula, Harsha V., Cherupally, Pavan K., Gurram, Sushanth R., Ahmad, Amir, Zahra, Andleeb, Abbas, Zia
An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations.
Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions
Johnson, Daniel D., Hanchi, Ayoub El, Maddison, Chris J.
Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.
Joint Probability Trees
Nyga, Daniel, Picklum, Mareike, Schierenbeck, Tom, Beetz, Michael
Joint probability distributions offer a wide range of highpotential applications in engineering, science, and technology (Chater et al., 2006; Griffiths et al., 2008; Knill & Pouget, 2004). Besides families of continuous distributions, introducing strong independence assumptions that must be probabilistic graphical models (PGMs), such as Bayesian known prior to learning and may turn out to be too great networks and Markov random fields (Koller & Friedman, simplifications of a model to be of practical use (Besag, 2009), are the de-facto standard in probabilistic knowledge 1975; Jain, 2012). As a simple example, consider a probability representation. They provide graph-based languages to space X, Y, C of two numeric variables, X and Y, model dependencies and independencies of variables, and and one symbolic variable C, dom(C) = {Red, Blue} as local joint or conditional distributions that quantify the statistical illustrated in Figures 1a and 1b. Let the symbolic values Red dependencies. However, the practical applicability of and Blue demaracate two clusters that are approximately PGMs suffers from the representational and computational normally distributed.
A model-free feature selection technique of feature screening and random forest based recursive feature elimination
Due to the development of data technology, feature selection plays an important role in both statistics and machine learning. High dimensional and ultra-high dimensional datasets are widely used in many fields, such as finance, image recognition, text classification, etc. Although more detailed information is provided with the increase of dimensions, the existence of a large number of redundant features weakens the generalization ability of models and increases the difficulty of data analysis (Jain et al., 2000). Thus, the efficiency of feature selection is crucial, as it focuses on choosing a small subset of informative features that contain the information of data and determine the source of the specific concerns derived from the study. In many data analyses, feature selection is a significant and frequently used dimensionality reduction technique and is often considered a key preprocessing step in data analysis for model benefits, such as interpretability, accuracy, lower computational costs, and less prone to overfitting (Sheikhpour et al., 2017; Khaire and Dhanalakshmi, 2022).