Regression
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Dubois, Yann, Galambosi, Balázs, Liang, Percy, Hashimoto, Tatsunori B.
LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?". To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at https://tatsu-lab.github.io/alpaca_eval/ .
Bayesian Additive Regression Networks
We apply Bayesian Additive Regression Tree (BART) principles to training an ensemble of small neural networks for regression tasks. Using Markov Chain Monte Carlo, we sample from the posterior distribution of neural networks that have a single hidden layer. To create an ensemble of these, we apply Gibbs sampling to update each network against the residual target value (i.e. subtracting the effect of the other networks). We demonstrate the effectiveness of this technique on several benchmark regression problems, comparing it to equivalent shallow neural networks, BART, and ordinary least squares. Our Bayesian Additive Regression Networks (BARN) provide more consistent and often more accurate results. On test data benchmarks, BARN averaged between 5 to 20 percent lower root mean square error. This error performance does come at the cost, however, of greater computation time. BARN sometimes takes on the order of a minute where competing methods take a second or less. But, BARN without cross-validated hyperparameter tuning takes about the same amount of computation time as tuned other methods. Yet BARN is still typically more accurate.
Wasserstein F-tests for Fr\'echet regression on Bures-Wasserstein manifolds
This paper considers the problem of regression analysis with random covariance matrix as outcome and Euclidean covariates in the framework of Fr\'echet regression on the Bures-Wasserstein manifold. Such regression problems have many applications in single cell genomics and neuroscience, where we have covariance matrix measured over a large set of samples. Fr\'echet regression on the Bures-Wasserstein manifold is formulated as estimating the conditional Fr\'echet mean given covariates $x$. A non-asymptotic $\sqrt{n}$-rate of convergence (up to $\log n$ factors) is obtained for our estimator $\hat{Q}_n(x)$ uniformly for $\left\|x\right\| \lesssim \sqrt{\log n}$, which is crucial for deriving the asymptotic null distribution and power of our proposed statistical test for the null hypothesis of no association. In addition, a central limit theorem for the point estimate $\hat{Q}_n(x)$ is obtained, giving insights to a test for covariate effects. The null distribution of the test statistic is shown to converge to a weighted sum of independent chi-squares, which implies that the proposed test has the desired significance level asymptotically. Also, the power performance of the test is demonstrated against a sequence of contiguous alternatives. Simulation results show the accuracy of the asymptotic distributions. The proposed methods are applied to a single cell gene expression data set that shows the change of gene co-expression network as people age.
Personality-affected Emotion Generation in Dialog Systems
Wen, Zhiyuan, Cao, Jiannong, Shen, Jiaxing, Yang, Ruosong, Liu, Shuaiqi, Sun, Maosong
Generating appropriate emotions for responses is essential for dialog systems to provide human-like interaction in various application scenarios. Most previous dialog systems tried to achieve this goal by learning empathetic manners from anonymous conversational data. However, emotional responses generated by those methods may be inconsistent, which will decrease user engagement and service quality. Psychological findings suggest that the emotional expressions of humans are rooted in personality traits. Therefore, we propose a new task, Personality-affected Emotion Generation, to generate emotion based on the personality given to the dialog system and further investigate a solution through the personality-affected mood transition. Specifically, we first construct a daily dialog dataset, Personality EmotionLines Dataset (PELD), with emotion and personality annotations. Subsequently, we analyze the challenges in this task, i.e., (1) heterogeneously integrating personality and emotional factors and (2) extracting multi-granularity emotional information in the dialog context. Finally, we propose to model the personality as the transition weight by simulating the mood transition process in the dialog system and solve the challenges above. We conduct extensive experiments on PELD for evaluation. Results suggest that by adopting our method, the emotion generation performance is improved by 13% in macro-F1 and 5% in weighted-F1 from the BERT-base model.
Analyzing Economic Convergence Across the Americas: A Survival Analysis Approach to GDP per Capita Trajectories
Abstract: By integrating survival analysis, machine learning algorithms, and economic interpretation, this research examines the temporal dynamics associated with attaining a 5 percent rise in purchasing power parity-adjusted GDP per capita over a period of 120 months (2013-2022). A comparative investigation reveals that DeepSurv is proficient at capturing non-linear interactions, although standard models exhibit comparable performance under certain circumstances. The weight matrix evaluates the economic ramifications of vulnerabilities, risks, and capacities. In order to meet the GDPpc objective, the findings emphasize the need of a balanced approach to risk-taking, strategic vulnerability reduction, and investment in governmental capacities and social cohesiveness. Policy guidelines promote individualized approaches that take into account the complex dynamics at play while making decisions. JEL: 04, C8, C5, O1 1. Introduction In contemporary economic research, the exploration of temporal dynamics in a nation's journey to achieve a specific level of GDP per capita gains paramount importance. This empirical investigation, conducted across 33 American countries, adopts a nuanced approach by incorporating a comprehensive dataset that includes countries with right-censored data (9 countries) and those reaching a 5% increase in GDP per capita at purchasing power parity (PIBpcPPP) within 120 months (24 countries). In addressing the central query, this research aims to unravel the intricate relationship of variables and risks influencing the time required for a country to achieve the specified 5% increase in GDP per capita. Leveraging advanced statistical techniques, particularly survival analysis, the study incorporates key variables such as Vul_Inherent, Vul_Fragility_Democracy, and Vul_Human Rights, offering a robust understanding of multifaceted vulnerabilities. This academic pursuit emphasizes rigorous methodologies, empirical analyses, and data-driven insights.
Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot
Henry, Catherine, Kennington, Casey
Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world. That is, rather than starting with a top-down symbol (e.g., a word referring to an object) and providing meaning through the application of predetermined samples, the robot autonomously and gradually breaks up its exploration space into a series of increasingly specific unlabeled categories at which point an external expert may optionally provide a symbol association. We extend prior work by using a robot that can observe the visual world, introducing a higher dimensional sensory space, and using a more generalizable method of category building. Our experiments show that the robot learns categories based on actions and what it visually observes, and that those categories can be symbolically grounded into.https://info.arxiv.org/help/prep#comments
A Bayesian Regression Approach for Estimating the Impact of COVID-19 on Consumer Behavior in the Restaurant Industry
The COVID-19 pandemic has had a long-term impact on industries worldwide, with the hospitality and food industry facing significant challenges, leading to the permanent closure of many restaurants and the loss of jobs. In this study, we developed an innovative analytical framework using Hamiltonian Monte Carlo for predictive modeling with Bayesian regression, aiming to estimate the change point in consumer behavior towards different types of restaurants due to COVID-19. Our approach emphasizes a novel method in computational analysis, providing insights into customer behavior changes before and after the pandemic. This research contributes to understanding the effects of COVID-19 on the restaurant industry and is valuable for restaurant owners and policymakers.
What is to be gained by ensemble models in analysis of spectroscopic data?
Vibrational spectroscopic techniques, including near-infrared (NIR), mid-infrared (MIR), and Raman, use the effect of light to provide information about the constituents of a sample. These low cost, rapid and noninvasive techniques are widely and routinely used in many application domains. Prediction in spectroscopic data is a topic of major interest in chemometric literature, see for example Frizzarin et al. (2021c,b); Singh and Domijan (2019). Numerous advances in statistical machine learning model methodology in the past few decades offer the potential to improve prediction performance over the well-established partial least squares (PLS) approach. Comparative analyses of algorithm prediction ability for spectroscopic data have shown that PLS variants perform strongly Frizzarin et al. (2021b); Singh and Domijan (2019), but that there isn't a single model that will outperform others in all settings.
Detecting Gender Bias in Course Evaluations
Lindau, Sarah, Nilsson, Linnea
We use different methods to examine and explore the data and find differences in what students write about courses depending on gender of the examiner. Data from English and Swedish courses are evaluated and compared, in order to capture more nuance in the gender bias that might be found. Here we present the results from the work so far, but this is an ongoing project and there is more work to do.
Explainable AI Integrated Feature Engineering for Wildfire Prediction
Fan, Di, Biswas, Ayan, Ahrens, James Paul
Wildfires present intricate challenges for prediction, necessitating the use of sophisticated machine learning techniques for effective modeling\cite{jain2020review}. In our research, we conducted a thorough assessment of various machine learning algorithms for both classification and regression tasks relevant to predicting wildfires. We found that for classifying different types or stages of wildfires, the XGBoost model outperformed others in terms of accuracy and robustness. Meanwhile, the Random Forest regression model showed superior results in predicting the extent of wildfire-affected areas, excelling in both prediction error and explained variance. Additionally, we developed a hybrid neural network model that integrates numerical data and image information for simultaneous classification and regression. To gain deeper insights into the decision-making processes of these models and identify key contributing features, we utilized eXplainable Artificial Intelligence (XAI) techniques, including TreeSHAP, LIME, Partial Dependence Plots (PDP), and Gradient-weighted Class Activation Mapping (Grad-CAM). These interpretability tools shed light on the significance and interplay of various features, highlighting the complex factors influencing wildfire predictions. Our study not only demonstrates the effectiveness of specific machine learning models in wildfire-related tasks but also underscores the critical role of model transparency and interpretability in environmental science applications.