Regression
Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness
Tian, Ye, Gu, Yuqi, Feng, Yang
Representation multi-task learning (MTL) and transfer learning (TL) have achieved tremendous success in practice. However, the theoretical understanding of these methods is still lacking. Most existing theoretical works focus on cases where all tasks share the same representation, and claim that MTL and TL almost always improve performance. However, as the number of tasks grows, assuming all tasks share the same representation is unrealistic. Also, this does not always match empirical findings, which suggest that a shared representation may not necessarily improve single-task or target-only learning performance. In this paper, we aim to understand how to learn from tasks with \textit{similar but not exactly the same} linear representations, while dealing with outlier tasks. With a known intrinsic dimension, we propose two algorithms that are \textit{adaptive} to the similarity structure and \textit{robust} to outlier tasks under both MTL and TL settings. Our algorithms outperform single-task or target-only learning when representations across tasks are sufficiently similar and the fraction of outlier tasks is small. Furthermore, they always perform no worse than single-task learning or target-only learning, even when the representations are dissimilar. We provide information-theoretic lower bounds to show that our algorithms are nearly \textit{minimax} optimal in a large regime. We also propose an algorithm to adapt to the unknown intrinsic dimension. We conduct two simulation studies to verify our theoretical results.
Machine Learning Meets Mental Training -- A Proof of Concept Applied to Memory Sports
"Mens sana in corpore sano" (Juvenal, 100-127 AD) Mental training has long been part of human culture, appearing in several different forms ranging from meditation to particular games or cognitive exercises aimed at various purposes. The past decades, however, have seen it losing its cardinal role in the well-roundedness of an individual and becoming more of a side hustle, confined to particular hobbies or to specific techniques needed for mental-health purposes. By contrast, recent years have seen an exponential investment in and development of artificial intelligence and machine learning technologies, which seem to be successfully tackling increasingly difficult tasks and problems. This work, then, aims to combine the two fields together by presenting a practical implementation of machine learning to the particular form of mental training that is the art of memory, taken in its competitive version called "Memory Sports". Such a fusion, on the one hand, strives to raise awareness about both realms, while on the other it seeks to encourage research in this mixed field as a way to, ultimately, drive forward the development of this seemingly underestimated sport. After first introducing the topic of mental training and its particular branch of Memory Sports, in the first chapter, the machine learning involved in the project is explained in the second chapter. The third chapter, then, presents two practical implementations of machine learning in Memory Sports, the results of which are discussed in the final chapter, together with several potential directions for future research. Ultimately, as well as stimulating interest and inspiration regarding the two fields involved in this research and exploring their points of contact, the aim here is also to investigate potential developments of human-machine collaborations, which are likely to be the focus of the next advances in technology and society overall. Starting to think in this view can help better prepare for the abrupt changes that might come and even be part of them, so as to drive their aim and scope toward a more responsible, and thus better, outcome.
Sparsified Simultaneous Confidence Intervals for High-Dimensional Linear Models
Zhu, Xiaorui, Qin, Yichen, Wang, Peng
High-dimensional data analysis plays an important role in modern scientific discoveries. There has been extensive work on high-dimensional variable selection and estimation using penalized regressions, such as Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang et al., 2010), and selection by partitioning solution paths (Liu and Wang, 2018). In recent years, inference for the true regression coefficients and the true model began to attract attention. A major challenge of high-dimensional inference is how to quantify the uncertainty of the coefficient estimate because such uncertainty depends on two components, the uncertainty in parameter estimation given the selected model, the uncertainty in selecting the model, both of which are difficult to estimate and are actively studied. For inference of the regression coefficients, Scheffรฉ (1953) introduces the notion of simultaneous confidence intervals, which is a sequence of intervals containing the true coefficients at a given probability. For the high-dimensional linear models, Dezeure et al. (2017) and Zhang and Cheng (2017) construct the simultaneous confidence intervals using the debiased Lasso approach (van de Geer et al., 2014; Zhang and Zhang, 2014).
CAMP: A Context-Aware Cricket Players Performance Metric
Ayub, Muhammad Sohaib, Ullah, Naimat, Ali, Sarwan, Khan, Imdad Ullah, Awais, Mian Muhammad, Khan, Muhammad Asad, Faizullah, Safiullah
Cricket is the second most popular sport after soccer in terms of viewership. However, the assessment of individual player performance, a fundamental task in team sports, is currently primarily based on aggregate performance statistics, including average runs and wickets taken. We propose Context-Aware Metric of player Performance, CAMP, to quantify individual players' contributions toward a cricket match outcome. CAMP employs data mining methods and enables effective data-driven decision-making for selection and drafting, coaching and training, team line-ups, and strategy development. CAMP incorporates the exact context of performance, such as opponents' strengths and specific circumstances of games, such as pressure situations. We empirically evaluate CAMP on data of limited-over cricket matches between 2001 and 2019. In every match, a committee of experts declares one player as the best player, called Man of the M}atch (MoM). The top two rated players by CAMP match with MoM in 83\% of the 961 games. Thus, the CAMP rating of the best player closely matches that of the domain experts. By this measure, CAMP significantly outperforms the current best-known players' contribution measure based on the Duckworth-Lewis-Stern (DLS) method.
SALC: Skeleton-Assisted Learning-Based Clustering for Time-Varying Indoor Localization
Hsiao, An-Hung, Shen, Li-Hsiang, Chang, Chen-Yi, Chiu, Chun-Jie, Feng, Kai-Ten
Wireless indoor localization has attracted significant amount of attention in recent years. Using received signal strength (RSS) obtained from WiFi access points (APs) for establishing fingerprinting database is a widely utilized method in indoor localization. However, the time-variant problem for indoor positioning systems is not well-investigated in existing literature. Compared to conventional static fingerprinting, the dynamicallyreconstructed database can adapt to a highly-changing environment, which achieves sustainability of localization accuracy. To deal with the time-varying issue, we propose a skeleton-assisted learning-based clustering localization (SALC) system, including RSS-oriented map-assisted clustering (ROMAC), cluster-based online database establishment (CODE), and cluster-scaled location estimation (CsLE). The SALC scheme jointly considers similarities from the skeleton-based shortest path (SSP) and the time-varying RSS measurements across the reference points (RPs). ROMAC clusters RPs into different feature sets and therefore selects suitable monitor points (MPs) for enhancing location estimation. Moreover, the CODE algorithm aims for establishing adaptive fingerprint database to alleviate the timevarying problem. Finally, CsLE is adopted to acquire the target position by leveraging the benefits of clustering information and estimated signal variations in order to rescale the weights fromweighted k-nearest neighbors (WkNN) method. Both simulation and experimental results demonstrate that the proposed SALC system can effectively reconstruct the fingerprint database with an enhanced location estimation accuracy, which outperforms the other existing schemes in the open literature.
Othering and low prestige framing of immigrant cuisines in US restaurant reviews and large language models
Luo, Yiwei, Gligoriฤ, Kristina, Jurafsky, Dan
Identifying and understanding implicit attitudes toward food can help efforts to mitigate social prejudice due to food's pervasive role as a marker of cultural and ethnic identity. Stereotypes about food are a form of microaggression that contribute to harmful public discourse that may in turn perpetuate prejudice toward ethnic groups and negatively impact economic outcomes for restaurants. Through careful linguistic analyses, we evaluate social theories about attitudes toward immigrant cuisine in a large-scale study of framing differences in 2.1M English language Yelp reviews of restaurants in 14 US states. Controlling for factors such as restaurant price and neighborhood racial diversity, we find that immigrant cuisines are more likely to be framed in objectifying and othering terms of authenticity (e.g., authentic, traditional), exoticism (e.g., exotic, different), and prototypicality (e.g., typical, usual), but that non-Western immigrant cuisines (e.g., Indian, Mexican) receive more othering than European cuisines (e.g., French, Italian). We further find that non-Western immigrant cuisines are framed less positively and as lower status, being evaluated in terms of affordability and hygiene. Finally, we show that reviews generated by large language models (LLMs) reproduce many of the same framing tendencies. Our results empirically corroborate social theories of taste and gastronomic stereotyping, and reveal linguistic processes by which such attitudes are reified.
Towards Generalizable Detection of Urgency of Discussion Forum Posts
ล vรกbenskรฝ, Valdemar, Baker, Ryan S., Zambrano, Andrรฉs, Zou, Yishan, Slater, Stefan
Students who take an online course, such as a MOOC, use the course's discussion forum to ask questions or reach out to instructors when encountering an issue. However, reading and responding to students' questions is difficult to scale because of the time needed to consider each message. As a result, critical issues may be left unresolved, and students may lose the motivation to continue in the course. To help address this problem, we build predictive models that automatically determine the urgency of each forum post, so that these posts can be brought to instructors' attention. This paper goes beyond previous work by predicting not just a binary decision cut-off but a post's level of urgency on a 7-point scale. First, we train and cross-validate several models on an original data set of 3,503 posts from MOOCs at University of Pennsylvania. Second, to determine the generalizability of our models, we test their performance on a separate, previously published data set of 29,604 posts from MOOCs at Stanford University. While the previous work on post urgency used only one data set, we evaluated the prediction across different data sets and courses. The best-performing model was a support vector regressor trained on the Universal Sentence Encoder embeddings of the posts, achieving an RMSE of 1.1 on the training set and 1.4 on the test set. Understanding the urgency of forum posts enables instructors to focus their time more effectively and, as a result, better support student learning.
Using Linear Regression for Iteratively Training Neural Networks
We present a simple linear regression based approach for learning the weights and biases of a neural network, as an alternative to standard gradient based backpropagation. The present work is exploratory in nature, and we restrict the description and experiments to (i) simple feedforward neural networks, (ii) scalar (single output) regression problems, and (iii) invertible activation functions. However, the approach is intended to be extensible to larger, more complex architectures. The key idea is the observation that the input to every neuron in a neural network is a linear combination of the activations of neurons in the previous layer, as well as the parameters (weights and biases) of the layer. If we are able to compute the ideal total input values to every neuron by working backwards from the output, we can formulate the learning problem as a linear least squares problem which iterates between updating the parameters and the activation values. We present an explicit algorithm that implements this idea, and we show that (at least for small problems) the approach is more stable and faster than gradient-based methods.
balance -- a Python package for balancing biased data samples
Sarig, Tal, Galili, Tal, Eilat, Roee
Surveys are an important research tool, providing unique measurements on subjective experiences such as sentiment and opinions that cannot be measured by other means. However, because survey data is collected from a self-selected group of participants, directly inferring insights from it to a population of interest, or training ML models on such data, can lead to erroneous estimates or under-performing models. In this paper we present balance, an open-source Python package by Meta, offering a simple workflow for analyzing and adjusting biased data samples with respect to a population of interest. The balance workflow includes three steps: understanding the initial bias in the data relative to a target we would like to infer, adjusting the data to correct for the bias by producing weights for each unit in the sample based on propensity scores, and evaluating the final biases and the variance inflation after applying the fitted weights. The package provides a simple API that can be used by researchers and data scientists from a wide range of fields on a variety of data. The paper provides the relevant context, methodological background, and presents the package's API.
Efficient Strongly Polynomial Algorithms for Quantile Regression
Shetiya, Suraj, Hasan, Shohedul, Asudeh, Abolfazl, Das, Gautam
Linear Regression is a seminal technique in statistics and machine learning, where the objective is to build linear predictive models between a response (i.e., dependent) variable and one or more predictor (i.e., independent) variables from a given dataset of n instances, where each instance is a set of values of the independent variables and the corresponding value of the dependent variable. One of the classical and widely used approaches is Ordinary Least Square Regression (OLS), where the objective is the minimize the average squared error between the predicted and actual value of the dependent variable. Another classical approach is Quantile Regression (QR), where the objective is to minimize the average weighted absolute error between the predicted and actual value of the dependent variable. QR (also known as "Median Regression" for the special case of the middle quantile), is less affected by outliers and thus statistically a more robust alternative to OLS [15, 18]. However, while there exist efficient algorithms for OLS, the state-of-art algorithms for QR require solving large linear programs with many variables and constraints. They can be solved using using interior point methods [24] which are weakly polynomial (i.e., in the arithmetic computation model the running time is polynomial in the number of bits required to represent the rational numbers in the input), or using Simplex-based exterior point methods which can have exponential time complexity in the worst case [10]. The main focus of our paper is an investigation of the computational complexity of Quantile Regression, and in particular, to design efficient strongly polynomial algorithms (i.e., in the arithmetic computation model the running time is polynomial in the number of rational numbers in the input) for various special cases of the problem.