Regression
FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear Bandits
Chakraborty, Sunrit, Roy, Saptarshi, Basu, Debabrota
High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret up to logarithmic factors. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.
Parallel Algorithm for Optimal Threshold Labeling of Ordinal Regression Methods
Yamasaki, Ryoya, Tanaka, Toshiyuki
Ordinal regression (OR) is classification of ordinal data in which the underlying categorical target variable has a natural ordinal relation for the underlying explanatory variable. For $K$-class OR tasks, threshold methods learn a one-dimensional transformation (1DT) of the explanatory variable so that 1DT values for observations of the explanatory variable preserve the order of label values $1,\ldots,K$ for corresponding observations of the target variable well, and then assign a label prediction to the learned 1DT through threshold labeling, namely, according to the rank of an interval to which the 1DT belongs among intervals on the real line separated by $(K-1)$ threshold parameters. In this study, we propose a parallelizable algorithm to find the optimal threshold labeling, which was developed in previous research, and derive sufficient conditions for that algorithm to successfully output the optimal threshold labeling. In a numerical experiment we performed, the computation time taken for the whole learning process of a threshold method with the optimal threshold labeling could be reduced to approximately 60\,\% by using the proposed algorithm with parallel processing compared to using an existing algorithm based on dynamic programming.
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Makelov, Aleksandar, Lange, George, Nanda, Neel
Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.
Movie Revenue Prediction using Machine Learning Models
Udandarao, Vikranth, Gupta, Pratyush
In the contemporary film industry, accurately predicting a movie's earnings is paramount for maximizing profitability. This project aims to develop a machine learning model for predicting movie earnings based on input features like the movie name, the MPAA rating of the movie, the genre of the movie, the year of release of the movie, the IMDb Rating, the votes by the watchers, the director, the writer and the leading cast, the country of production of the movie, the budget of the movie, the production company and the runtime of the movie. Through a structured methodology involving data collection, preprocessing, analysis, model selection, evaluation, and improvement, a robust predictive model is constructed. Linear Regression, Decision Trees, Random Forest Regression, Bagging, XGBoosting and Gradient Boosting have been trained and tested. Model improvement strategies include hyperparameter tuning and cross-validation. The resulting model offers promising accuracy and generalization, facilitating informed decision-making in the film industry to maximize profits.
Analyze Additive and Interaction Effects via Collaborative Trees
We present Collaborative Trees, a novel tree model designed for regression prediction, along with its bagging version, which aims to analyze complex statistical associations between features and uncover potential patterns inherent in the data. We decompose the mean decrease in impurity from the proposed tree model to analyze the additive and interaction effects of features on the response variable. Additionally, we introduce network diagrams to visually depict how each feature contributes additively to the response and how pairs of features contribute interaction effects. Through a detailed demonstration using an embryo growth dataset, we illustrate how the new statistical tools aid data analysis, both visually and numerically. Moreover, we delve into critical aspects of tree modeling, such as prediction performance, inference stability, and bias in feature importance measures, leveraging real datasets and simulation experiments for comprehensive discussions. On the theory side, we show that Collaborative Trees, built upon a ``sum of trees'' approach with our own innovative tree model regularization, exhibit characteristics akin to matching pursuit, under the assumption of high-dimensional independent binary input features (or one-hot feature groups). This newfound link sheds light on the superior capability of our tree model in estimating additive effects of features, a crucial factor for accurate interaction effect estimation.
Excess Delay from GDP: Measurement and Causal Analysis
Ground Delay Programs (GDPs) have been widely used to resolve excessive demand-capacity imbalances at arrival airports by shifting foreseen airborne delay to pre-departure ground delay. While offering clear safety and efficiency benefits, GDPs may also create additional delay because of imperfect execution and uncertainty in predicting arrival airport capacity. This paper presents a methodology for measuring excess delay resulting from individual GDPs and investigates factors that influence excess delay using regularized regression models. We measured excess delay for 1210 GDPs from 33 U.S. airports in 2019. On a per-restricted flight basis, the mean excess delay is 35.4 min with std of 20.6 min. In our regression analysis of the variation in excess delay, ridge regression is found to perform best. The factors affecting excess delay include time variations during gate out and taxi out for flights subject to the GDP, program rate setting and revisions, and GDP time duration.
Causal Customer Churn Analysis with Low-rank Tensor Block Hazard Model
Gao, Chenyin, Zhang, Zhiming, Yang, Shu
This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor completion for the parameter tensor. This captures hidden customer characteristics and temporal elements from churn records, effectively addressing the binary nature of churn data and its time-monotonic trends. Our model also uniquely categorizes interventions by their similar impacts, enhancing the precision and practicality of implementing customer retention strategies. For computational efficiency, we apply a projected gradient descent algorithm combined with spectral clustering. We lay down the theoretical groundwork for our model, including its non-asymptotic properties. The efficacy and superiority of our model are further validated through comprehensive experiments on both simulated and real-world applications.
Machine learning-based optimization workflow of the homogeneity of spunbond nonwovens with human validation
Victor, Viny Saajan, Schmeiรer, Andre, Leitte, Heike, Gramsch, Simone
In the last ten years, the average annual growth rate of nonwoven production was 4%. In 2020 and 2021, nonwoven production has increased even further due to the huge demand for nonwoven products needed for protective clothing such as FFP2 masks to combat the COVID19 pandemic. Optimizing the production process is still a challenge due to its high nonlinearity. In this paper, we present a machine learning-based optimization workflow aimed at improving the homogeneity of spunbond nonwovens. The optimization workflow is based on a mathematical model that simulates the microstructures of nonwovens. Based on trainingy data coming from this simulator, different machine learning algorithms are trained in order to find a surrogate model for the time-consuming simulator. Human validation is employed to verify the outputs of machine learning algorithms by assessing the aesthetics of the nonwovens. We include scientific and expert knowledge into the training data to reduce the computational costs involved in the optimization process. We demonstrate the necessity and effectiveness of our workflow in optimizing the homogeneity of nonwovens.
Auditing the Fairness of COVID-19 Forecast Hub Case Prediction Models
Abrar, Saad Mohammad, Awasthi, Naman, Smolyak, Daniel, Frias-Martinez, Vanessa
The COVID-19 Forecast Hub was founded in 2020 and serves as a "central repository of COVID-19 forecasts from over 50 independent research groups" [1]. Participant research groups submit county, state and national US COVID-19 forecasts with a standardized format; and the Forecast Hub provides an interactive visualization tool to help decision makers and the general public analyze weekly predictions for COVID-19 hospitalizations, cases and deaths. The standardized predictions collected from all research groups, as well as the predictions for an ensemble model that brings all individual predictions together, are also shared with the Centers for Disease Control and Prevention (CDC) who uses these results for their official COVID-19 communications [2]. The COVID-19 Forecast Hub has been, and continues to be, a critical centralized resource to promote transparent decision making. Nevertheless, by focusing exclusively on prediction accuracy at different spatial granularities (e.g., county or state), the Forecast Hub fails to evaluate whether the proposed models are fair i.e., share similar prediction performance across social determinants that have been known to play a role in COVID-19 including race, ethnicity and rurality [3, 4]. Diverse prediction performance across social determinants - for example, higher prediction errors for a given minority race or ethnicity - could negatively impact resource allocation and intervention decisions e.g., hospital beds or stay-at-home orders, given that the CDC appears to be using the Forecast Hub predictions for official communications that subsequently inform policy decisions [2]. In other words, allocation or intervention harms might occur if models from the Forecast Hub are used to inform decision making across communities without taking into account fairness metrics [5]. There are many reasons why the COVID-19 prediction performance can be different across social determinants such as race, ethnicity or urbanization levels. The Forecast Hub's COVID-19 prediction models are trained on datasets containing COVID-19
Sharpness-Aware Minimization in Genetic Programming
Bakurov, Illya, Haut, Nathan, Banzhaf, Wolfgang
The automatic discovery of mathematical expressions to describe phenomena captured in data is an extremely valuable tool for accelerating scientific discovery since the mathematical expressions can be used to make predictions about the systems that generated the data and the expressions can be directly studied to extract new insights into the system. There are many approaches for finding equations that fit data: linear regression, polynomial regression, SINDy [7], neural-symbolic regression [6], symbolic regression [19], etc. Genetic programming (GP) is a popular method for finding equations that fit data since it allows greater flexibility for the discovery of non-linear behaviors in data while also being effective in small data scenarios, unlike deep learning (DL) approaches which generally require large training data sets. This ability of GP to be effective in small data scenarios is likely in some part due to evolution's bias for simple solutions, and naturally simple solutions are less likely to overfit [5]. Even so, in small data scenarios, the models are naturally underconstrained in the interstitial spaces between the training data points, which means that surprising and unexpected behavior can occur when interpolating. Ideally, we would want the models to be at least stable (smooth) when interpolating, otherwise trust in the models can be severely diminished. Some GP methods have been proposed to help lock down the behavior of models in these interstitial spaces to improve the robustness against overfitting in small data scenarios such as order of non-linearity [33], model curvature [30], random sampling technique (RST) [14], RelaxGP [8], and overfit repulsors [31]. Order of non-linearity and model curvature are approaches that attempt to take properties of the model to predict if they are overfitting [30, 33]. Random sampling attempts to reduce the risk of overfitting by ensuring that no model sees the whole data set in a single generation [14].