Regression
High-dimensional analysis of double descent for linear regression with random projections
Over-parameterized models estimated with some form of gradient descent come in various forms, such as linear regression with potentially non-linear features, neural networks, or kernel methods. The double descent phenomenon can be seen empirically in several of these models [6, 15]: Given a fixed prediction problem, when the number of parameters of the model is increasing from zero to the number of observations, the generalization performance traditionally goes down and then up, due to overfitting. Once the number of parameters exceeds the number of observations, the generalization error decreases again, as illustrated in Figure 1. The phenomenon has been theoretically analyzed in several settings, such as random features based on neural networks [27], random Fourier features [24], or linear regression [7, 17]. While the analysis of [27, 24] for random features corresponds to a single prediction problem with a sequence of increasingly larger prediction models, most of the analysis of [17] for linear regression does not consider a single problem, but varying problems, which does not actually lead to a double descent curve. Random subsampling on a single prediction problem was analyzed with a simpler model with isotropic covariance matrices in [7] and [17, Section 5.2], but without a proper double descent as the model is too simple to account for a U-shaped curve in the under-parameterized regime. In work related to ours, principal component regression was analyzed by [37] with a double descent curve but with less general assumptions regarding the spectrum of the covariance matrix and the optimal predictor.
Human heuristics for AI-generated language are flawed
Jakesch, Maurice, Hancock, Jeffrey, Naaman, Mor
Human communication is increasingly intermixed with language generated by AI. Across chat, email, and social media, AI systems suggest words, complete sentences, or produce entire conversations. AI-generated language is often not identified as such but presented as language written by humans, raising concerns about novel forms of deception and manipulation. Here, we study how humans discern whether verbal self-presentations, one of the most personal and consequential forms of language, were generated by AI. In six experiments, participants (N = 4,600) were unable to detect self-presentations generated by state-of-the-art AI language models in professional, hospitality, and dating contexts. A computational analysis of language features shows that human judgments of AI-generated language are hindered by intuitive but flawed heuristics such as associating first-person pronouns, use of contractions, or family topics with human-written language. We experimentally demonstrate that these heuristics make human judgment of AI-generated language predictable and manipulable, allowing AI systems to produce text perceived as "more human than human." We discuss solutions, such as AI accents, to reduce the deceptive potential of language generated by AI, limiting the subversion of human intuition.
Neural Network Compression for Noisy Storage Devices
Isik, Berivan, Choi, Kristy, Zheng, Xin, Weissman, Tsachy, Ermon, Stefano, Wong, H. -S. Philip, Alaghi, Armin
Compression and efficient storage of neural network (NN) parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actual \textit{physical} storage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media with error-correcting codes (ECCs) provide robust error-free storage. However, this decoupled approach is inefficient as it ignores the overparameterization present in most NNs and forces the memory device to allocate the same amount of resources to every bit of information regardless of its importance. In this work, we investigate analog memory devices as an alternative to digital media -- one that naturally provides a way to add more protection for significant bits unlike its counterpart, but is noisy and may compromise the stored model's performance if used naively. We develop a variety of robust coding strategies for NN weight storage on analog devices, and propose an approach to jointly optimize model compression and memory resource allocation. We then demonstrate the efficacy of our approach on models trained on MNIST, CIFAR-10 and ImageNet datasets for existing compression techniques. Compared to conventional error-free digital storage, our method reduces the memory footprint by up to one order of magnitude, without significantly compromising the stored model's accuracy.
TM-vector: A Novel Forecasting Approach for Market stock movement with a Rich Representation of Twitter and Market data
Sasani, Faraz, Mousa, Ramin, Karkehabadi, Ali, Dehbashi, Samin, Mohammadi, Ali
Stock market forecasting has been a challenging part for many analysts and researchers. Trend analysis, statistical techniques, and movement indicators have traditionally been used to predict stock price movements, but text extraction has emerged as a promising method in recent years. The use of neural networks, especially recurrent neural networks, is abundant in the literature. In most studies, the impact of different users was considered equal or ignored, whereas users can have other effects. In the current study, we will introduce TM-vector and then use this vector to train an IndRNN and ultimately model the market users' behaviour. In the proposed model, TM-vector is simultaneously trained with both the extracted Twitter features and market information. Various factors have been used for the effectiveness of the proposed forecasting approach, including the characteristics of each individual user, their impact on each other, and their impact on the market, to predict market direction more accurately. Dow Jones 30 index has been used in current work. The accuracy obtained for predicting daily stock changes of Apple is based on various models, closed to over 95\% and for the other stocks is significant. Our results indicate the effectiveness of TM-vector in predicting stock market direction.
A new methodology to predict the oncotype scores based on clinico-pathological data with similar tumor profiles
Masry, Zeina Al, Pic, Romain, Dombry, Clément, Devalland, Christine
Introduction: The Oncotype DX (ODX) test is a commercially available molecular test for breast cancer assay that provides prognostic and predictive breast cancer recurrence information for hormone positive, HER2-negative patients. The aim of this study is to propose a novel methodology to assist physicians in their decision-making. Methods: A retrospective study between 2012 and 2020 with 333 cases that underwent an ODX assay from three hospitals in Bourgogne Franche-Comt{\'e} was conducted. Clinical and pathological reports were used to collect the data. A methodology based on distributional random forest was developed using 9 clinico-pathological characteristics. This methodology can be used particularly to identify the patients of the training cohort that share similarities with the new patient and to predict an estimate of the distribution of the ODX score. Results: The mean age of participants id 56.9 years old. We have correctly classified 92% of patients in low risk and 40.2% of patients in high risk. The overall accuracy is 79.3%. The proportion of low risk correct predicted value (PPV) is 82%. The percentage of high risk correct predicted value (NPV) is approximately 62.3%. The F1-score and the Area Under Curve (AUC) are of 0.87 and 0.759, respectively. Conclusion: The proposed methodology makes it possible to predict the distribution of the ODX score for a patient and provides an explanation of the predicted score. The use of the methodology with the pathologist's expertise on the different histological and immunohistochemical characteristics has a clinical impact to help oncologist in decision-making regarding breast cancer therapy.
Accurate Prediction of Global Mean Temperature through Data Transformation Techniques
Niyogi, Debdarsan, Srinivasan, J.
It is important to predict how the Global Mean Temperature (GMT) will evolve in the next few decades. The ability to predict historical data is a necessary first step toward the actual goal of making long-range forecasts. This paper examines the advantage of statistical and simpler Machine Learning (ML) methods instead of directly using complex ML algorithms and Deep Learning Neural Networks (DNN). Often neglected data transformation methods prior to applying different algorithms have been used as a means of improving predictive accuracy. The GMT time series is treated both as a univariate time series and also cast as a regression problem. Some steps of data transformations were found to be effective. Various simple ML methods did as well or better than the more well-known ones showing merit in trying a large bouquet of algorithms as a first step. Fifty-six algorithms were subject to Box-Cox, Yeo-Johnson, and first-order differencing and compared with the absence of them. Predictions for the annual GMT testing data were better than that published so far, with the lowest RMSE value of 0.02 $^\circ$C. RMSE for five-year mean GMT values for the test data ranged from 0.00002 to 0.00036 $^\circ$C.
Predicting Hurricane Evacuation Decisions with Interpretable Machine Learning Models
Sun, Yuran, Huang, Shih-Kai, Zhao, Xilei
The aggravating effects of climate change and the growing population in hurricane-prone areas escalate the challenges in large-scale hurricane evacuations. While hurricane preparedness and response strategies vastly rely on the accuracy and timeliness of the predicted households' evacuation decisions, current studies featuring psychological-driven linear models leave some significant limitations in practice. Hence, the present study proposes a new methodology for predicting households' evacuation decisions constructed by easily accessible demographic and resource-related predictors compared to current models with a high reliance on psychological factors. Meanwhile, an enhanced logistic regression (ELR) model that could automatically account for nonlinearities (i.e., univariate and bivariate threshold effects) by an interpretable machine learning approach is developed to secure the accuracy of the results. Specifically, low-depth decision trees are selected for nonlinearity detection to identify the critical thresholds, build a transparent model structure, and solidify the robustness. Then, an empirical dataset collected after Hurricanes Katrina and Rita is hired to examine the practicability of the new methodology. The results indicate that the enhanced logistic regression (ELR) model has the most convincing performance in explaining the variation of the households' evacuation decision in model fit and prediction capability compared to previous linear models. It suggests that the proposed methodology could provide a new tool and framework for the emergency management authorities to improve the estimation of evacuation traffic demands in a timely and accurate manner.
NFL Career Success as Predicted by NFL Scouting Combine
Szekely, Brian, Sinnott, Christian, Halow, Savannah, Ryan, Gregory
The National Football League (NFL) Scouting Combine serves as a tool to evaluate the skills of prospective players and assess their readiness to play in the NFL. The development of machine learning brings new opportunities in assessing the utility of the Scouting Combine. Using machine and statistical learning, it may be possible to predict future success of prospective athletes, as well as predict which Scouting Combine tests are the most important. Results from statistical learning research have been contradicting whether the Scouting combine is a useful metric for player success. In this study, we investigate if machine learning can be used to determine matriculation and future success in the NFL. Using Scouting Combine data, we evaluate six different algorithms' ability to predict whether a potential draft pick will play a single NFL snap (matriculation). If a player is drafted, we predict how many snaps they go on to play (success). We are able to predict matriculation with 83% accuracy; however, we are unable to predict later success. Our best performing algorithm returns large error and low explained variance (RMSE=1,210 snaps; ${R}^2$=0.17). These findings indicate that while the Scouting Combine can predict NFL matriculation, it may not be a reliable predictor of long-term player success.
A Survey on Event-based News Narrative Extraction
Norambuena, Brian Keith, Mitra, Tanushree, North, Chris
Narratives are fundamental to our understanding of the world, providing us with a natural structure for knowledge representation over time. Computational narrative extraction is a subfield of artificial intelligence that makes heavy use of information retrieval and natural language processing techniques. Despite the importance of computational narrative extraction, relatively little scholarly work exists on synthesizing previous research and strategizing future research in the area. In particular, this article focuses on extracting news narratives from an event-centric perspective. Extracting narratives from news data has multiple applications in understanding the evolving information landscape. This survey presents an extensive study of research in the area of event-based news narrative extraction. In particular, we screened over 900 articles that yielded 54 relevant articles. These articles are synthesized and organized by representation model, extraction criteria, and evaluation approaches. Based on the reviewed studies, we identify recent trends, open challenges, and potential research lines.
Generalizable machine learning for stress monitoring from wearable devices: A systematic literature review
Vos, Gideon, Trinh, Kelly, Sarnyai, Zoltan, Azghadi, Mostafa Rahimi
Introduction. The stress response has both subjective, psychological and objectively measurable, biological components. Both of them can be expressed differently from person to person, complicating the development of a generic stress measurement model. This is further compounded by the lack of large, labeled datasets that can be utilized to build machine learning models for accurately detecting periods and levels of stress. The aim of this review is to provide an overview of the current state of stress detection and monitoring using wearable devices, and where applicable, machine learning techniques utilized. Methods. This study reviewed published works contributing and/or using datasets designed for detecting stress and their associated machine learning methods, with a systematic review and meta-analysis of those that utilized wearable sensor data as stress biomarkers. The electronic databases of Google Scholar, Crossref, DOAJ and PubMed were searched for relevant articles and a total of 24 articles were identified and included in the final analysis. The reviewed works were synthesized into three categories of publicly available stress datasets, machine learning, and future research directions. Results. A wide variety of study-specific test and measurement protocols were noted in the literature. A number of public datasets were identified that are labeled for stress detection. In addition, we discuss that previous works show shortcomings in areas such as their labeling protocols, lack of statistical power, validity of stress biomarkers, and generalization ability. Conclusion. Generalization of existing machine learning models still require further study, and research in this area will continue to provide improvements as newer and more substantial datasets become available for study.