Regression
Neural networks for geospatial data
Geostatistics, the analysis of geocoded data, is traditionally based on stochastic process models which offer a coherent way to model data at any finite collection of locations while ensuring the generalizability of inference to the entire region.Gaussian processes (GP) with a mean function capturing effects of covariates and the covariance function encoding the spatial dependence, is a staple for geostatistical analysis, offering theoretical guarantees and practical benefits. GP are flexible enough to model any smooth spatial surface, and can be specified parsimoniously with covariance functions using a very small set of parameters. The spatial covariance parameters offer insights into the smoothness and spatial properties of the response process (Stein, 1999). The finite dimensional realizations of a GP are multivariate Gaussian, thereby offering estimates of the mean and covariance parameters via convenient maximization of the Gaussian likelihood, and predictions at new locations by using conditional Gaussian distributions (see, e.g., Banerjee et al., 2014; Cressie and Wikle, 2015, for detailed exposition on GP models for spatial and spatio-temporal data). Also, computational roadblocks to using GP for large spatial data have been greatly mitigated by recent advances (see, Heaton et al., 2019, for a recent review of scalable GP approaches). The mean function of a Gaussian process is often modeled as a linear regression on the covariates. The growing popularity and accessibility of machine learning algorithms such as neural networks, random forests, gradient boosted trees, capable of modeling complex non-linear relationships has heralded a paradigm shift. Practitioners are increasingly shunning models with parametric assumptions like linearity in favor of these machine learning approaches that can capture non-linearity and high-order interactions in a data-driven manner. The field of spatial statistics has not been insulated from this machine learning revolution.
Machine Learning Research Trends in Africa: A 30 Years Overview with Bibliometric Analysis Review
Ezugwu, Absalom E., Oyelade, Olaide N., Ikotun, Abiodun M., Agushaka, Jeffery O., Ho, Yuh-Shan
The machine learning (ML) paradigm has gained much popularity today. Its algorithmic models are employed in every field, such as natural language processing, pattern recognition, object detection, image recognition, earth observation and many other research areas. In fact, machine learning technologies and their inevitable impact suffice in many technological transformation agendas currently being propagated by many nations, for which the already yielded benefits are outstanding. From a regional perspective, several studies have shown that machine learning technology can help address some of Africa's most pervasive problems, such as poverty alleviation, improving education, delivering quality healthcare services, and addressing sustainability challenges like food security and climate change. In this state-of-the-art paper, a critical bibliometric analysis study is conducted, coupled with an extensive literature survey on recent developments and associated applications in machine learning research with a perspective on Africa. The presented bibliometric analysis study consists of 2761 machine learning-related documents, of which 89% were articles with at least 482 citations published in 903 journals during the past three decades. Furthermore, the collated documents were retrieved from the Science Citation Index EXPANDED, comprising research publications from 54 African countries between 1993 and 2021. The bibliometric study shows the visualization of the current landscape and future trends in machine learning research and its application to facilitate future collaborative research and knowledge exchange among authors from different research institutions scattered across the African continent.
Machine Learning Applications in Studying Mental Health Among Immigrants and Racial and Ethnic Minorities: A Systematic Review
Park, Khushbu Khatri, Ahmed, Abdulaziz, Al-Garadi, Mohammed Ali
Background: The use of machine learning (ML) in mental health (MH) research is increasing, especially as new, more complex data types become available to analyze. By systematically examining the published literature, this review aims to uncover potential gaps in the current use of ML to study MH in vulnerable populations of immigrants, refugees, migrants, and racial and ethnic minorities. Methods: In this systematic review, we queried Google Scholar for ML-related terms, MH-related terms, and a population of a focus search term strung together with Boolean operators. Backward reference searching was also conducted. Included peer-reviewed studies reported using a method or application of ML in an MH context and focused on the populations of interest. We did not have date cutoffs. Publications were excluded if they were narrative or did not exclusively focus on a minority population from the respective country. Data including study context, the focus of mental healthcare, sample, data type, type of ML algorithm used, and algorithm performance was extracted from each. Results: Our search strategies resulted in 67,410 listed articles from Google Scholar. Ultimately, 12 were included. All the articles were published within the last 6 years, and half of them studied populations within the US. Most reviewed studies used supervised learning to explain or predict MH outcomes. Some publications used up to 16 models to determine the best predictive power. Almost half of the included publications did not discuss their cross-validation method. Conclusions: The included studies provide proof-of-concept for the potential use of ML algorithms to address MH concerns in these special populations, few as they may be. Our systematic review finds that the clinical application of these models for classifying and predicting MH disorders is still under development.
Charting the Topography of the Neural Network Landscape with Thermal-Like Noise
Jules, Theo, Brener, Gal, Kachman, Tal, Levi, Noam, Bar-Sinai, Yohai
The training of neural networks is a complex, high-dimensional, non-convex and noisy optimization problem whose theoretical understanding is interesting both from an applicative perspective and for fundamental reasons. A core challenge is to understand the geometry and topography of the landscape that guides the optimization. In this work, we employ standard Statistical Mechanics methods, namely, phase-space exploration using Langevin dynamics, to study this landscape for an over-parameterized fully connected network performing a classification task on random data. Analyzing the fluctuation statistics, in analogy to thermal dynamics at a constant temperature, we infer a clear geometric description of the low-loss region. We find that it is a low-dimensional manifold whose dimension can be readily obtained from the fluctuations. Furthermore, this dimension is controlled by the number of data points that reside near the classification decision boundary. Importantly, we find that a quadratic approximation of the loss near the minimum is fundamentally inadequate due to the exponential nature of the decision boundary and the flatness of the low-loss region. This causes the dynamics to sample regions with higher curvature at higher temperatures, while producing quadratic-like statistics at any given temperature. We explain this behavior by a simplified loss model which is analytically tractable and reproduces the observed fluctuation statistics.
Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions
Deliu, Nina, Williams, Joseph Jay, Chakraborty, Bibhas
In recent years, reinforcement learning (RL) has acquired a prominent position in the space of health-related sequential decision-making, becoming an increasingly popular tool for delivering adaptive interventions (AIs). However, despite potential benefits, its real-life application is still limited, partly due to a poor synergy between the methodological and the applied communities. In this work, we provide the first unified survey on RL methods for learning AIs, using the common methodological umbrella of RL to bridge the two AI areas of dynamic treatment regimes and just-in-time adaptive interventions in mobile health. We outline similarities and differences between these two AI domains and discuss their implications for using RL. Finally, we leverage our experience in designing case studies in both areas to illustrate the tremendous collaboration opportunities between statistical, RL, and healthcare researchers in the space of AIs.
Minimax Signal Detection in Sparse Additive Models
In the interest of interpretability, computation, and circumventing the statistical curse of dimensionality plaguing high dimensional regression, structure is often assumed on the true regression function. Indeed, it might plausibly be argued that sparse linear regression is the distinguishing export of modern statistics. Despite its popularity, circumstances may call for more flexibility to capture nonlinear effects of the covariates. Striking a balance between flexibility and structure, Hastie and Tibshirani [19] proposed generalized additive models (GAMs) as a natural extension to the vaunted linear model. In a GAM, the regression function admits an additive decomposition of univariate (nonlinear) component functions. However, as in the linear model, the sample size must outpace the dimension for consistent estimation. Following modern statistical instinct, a sparse additive model is compelling [28, 32, 34, 37, 38, 47]. The regression function admits an additive decomposition of univariate functions for which only a small subset are nonzero; it is the combination of a GAM and sparsity.
Cross or Wait? Predicting Pedestrian Interaction Outcomes at Unsignalized Crossings
Zhang, Chi, Kalantari, Amir Hossein, Yang, Yue, Ni, Zhongjun, Markkula, Gustav, Merat, Natasha, Berger, Christian
Predicting pedestrian behavior when interacting with vehicles is one of the most critical challenges in the field of automated driving. Pedestrian crossing behavior is influenced by various interaction factors, including time to arrival, pedestrian waiting time, the presence of zebra crossing, and the properties and personality traits of both pedestrians and drivers. However, these factors have not been fully explored for use in predicting interaction outcomes. In this paper, we use machine learning to predict pedestrian crossing behavior including pedestrian crossing decision, crossing initiation time (CIT), and crossing duration (CD) when interacting with vehicles at unsignalized crossings. Distributed simulator data are utilized for predicting and analyzing the interaction factors. Compared with the logistic regression baseline model, our proposed neural network model improves the prediction accuracy and F1 score by 4.46% and 3.23%, respectively. Our model also reduces the root mean squared error (RMSE) for CIT and CD by 21.56% and 30.14% compared with the linear regression model. Additionally, we have analyzed the importance of interaction factors, and present the results of models using fewer factors. This provides information for model selection in different scenarios with limited input features.
Analysis of Interpolating Regression Models and the Double Descent Phenomenon
A regression model with more parameters than data points in the training data is overparametrized and has the capability to interpolate the training data. Based on the classical bias-variance tradeoff expressions, it is commonly assumed that models which interpolate noisy training data are poor to generalize. In some cases, this is not true. The best models obtained are overparametrized and the testing error exhibits the double descent behavior as the model order increases. In this contribution, we provide some analysis to explain the double descent phenomenon, first reported in the machine learning literature. We focus on interpolating models derived from the minimum norm solution to the classical least-squares problem and also briefly discuss model fitting using ridge regression. We derive a result based on the behavior of the smallest singular value of the regression matrix that explains the peak location and the double descent shape of the testing error as a function of model order.
Secure PAC Bayesian Regression via Real Shamir Secret Sharing
Gundersen, Jaron Skovsted, Kuskonmaz, Bulut, Wisniewski, Rafael
A common approach of system identification and machine learning is to generate a model by using training data to predict the test data instances as accurate as possible. Nonetheless, concerns about data privacy are increasingly raised, but not always addressed. We present a secure protocol for learning a linear model relying on recently described technique called real number secret sharing. We take as our starting point the PAC Bayesian bounds and deduce a closed form for the model parameters which depends on the data and the prior from the PAC Bayesian bounds. To obtain the model parameters one needs to solve a linear system. However, we consider the situation where several parties hold different data instances and they are not willing to give up the privacy of the data. Hence, we suggest to use real number secret sharing and multiparty computation to share the data and solve the linear regression in a secure way without violating the privacy of data. We suggest two methods; a secure inverse method and a secure Gaussian elimination method, and compare these methods at the end. The benefit of using secret sharing directly on real numbers is reflected in the simplicity of the protocols and the number of rounds needed. However, this comes with the drawback that a share might leak a small amount of information, but in our analysis we argue that the leakage is small.
Ontology for Healthcare Artificial Intelligence Privacy in Brazil
Vaz, Tiago Andres, Dora, José Miguel Silva, Lamb, Luís da Cunha, Camey, Suzi Alves
Using the terminology defined by current legislation, the article outlines a systematic approach to handling hospital data anonymously in preparation for its use in Artificial Intelligence (AI) applications in healthcare. The development process consisted of 7 pragmatic steps, including defining scope, selecting knowledge, reviewing important terms, constructing classes that describe designs used in epidemiological studies, machine learning paradigms, types of data and attributes, risks that anonymized data may be exposed to, privacy attacks, techniques to mitigate re-identification, privacy models, and metrics for measuring the effects of anonymization. The article concludes by demonstrating the practical implementation of this ontology in hospital settings for the development and validation of AI.