Regression
On the Saturation Effects of Spectral Algorithms in Large Dimensions
Lu, Weihao, Zhang, Haobo, Li, Yicheng, Lin, Qian
The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where $n \asymp d^{\gamma}$. More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification $\tau$'s. In particular, we find that these exact rate curves (varying along $\gamma$) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition $s>\tau$ while it occurs in fixed dimensional setting as long as $s>2\tau$).
AutoQML: A Framework for Automated Quantum Machine Learning
Roth, Marco, Kreplin, David A., Basilewitsch, Daniel, Bravo, Joรฃo F., Klau, Dennis, Marinov, Milan, Pranjic, Daniel, Stuehler, Horst, Willmann, Moritz, Zรถller, Marc-Andrรฉ
Automated Machine Learning (AutoML) has significantly advanced the efficiency of ML-focused software development by automating hyperparameter optimization and pipeline construction, reducing the need for manual intervention. Quantum Machine Learning (QML) offers the potential to surpass classical machine learning (ML) capabilities by utilizing quantum computing. However, the complexity of QML presents substantial entry barriers. We introduce \emph{AutoQML}, a novel framework that adapts the AutoML approach to QML, providing a modular and unified programming interface to facilitate the development of QML pipelines. AutoQML leverages the QML library sQUlearn to support a variety of QML algorithms. The framework is capable of constructing end-to-end pipelines for supervised learning tasks, ensuring accessibility and efficacy. We evaluate AutoQML across four industrial use cases, demonstrating its ability to generate high-performing QML pipelines that are competitive with both classical ML models and manually crafted quantum solutions.
Forecasting Monthly Residential Natural Gas Demand Using Just-In-Time-Learning Modeling
Alakent, Burak, Isikli, Erkan, Kadaifci, Cigdem, Taspinar, Tonguc S.
ABSTRACT Natural gas (NG) is relatively a clean source of energy, particularly compared to fossil fuels, and worldwide consumption of NG has been increasing almost linearly in the last two decades. A similar trend can also be seen in Turkey, while another similarity is the high dependence on impor ts for the continuous NG supply. It is crucial to accurately forecast future NG demand (NGD) in Turkey, especially, for import contracts; in this respect, forecasts of monthly NGD for the following year are of utmost importance. In the current study, the h istorical monthly NG consumption data between 2014 and 2024 provided by SOCAR, the local residential NG distribution company for two cities in Turkey, Bursa and Kayseri, was used to determine out - of - sample monthly NGD forecasts for a period of one year and nine months using various time series models, including SARIMA and ETS models, and a novel proposed machine learning method. The proposed method, named Just - in - Time - Learning - Gaussia n Process Regression (JITL - GPR), uses a novel feature representation for t he past NG demand values; instead of using past demand values as column - wise separate features, they are placed on a two - dimensional (2 - D) grid of year - month values. For each test point, a kernel function, tailored for the NGD predictions, is used in GPR t o predict the query point. Since a model is constructed separately for each test point, the proposed method is, indeed, an example of JITL. The JITL - GPR method is easy to use and optimize, and offers a reduction in forecast errors compared to traditional t ime series methods and a state - of - the - art combinat ion model; therefore, it is a promising tool for NGD forecasting in similar settings. INTRODUCTION In the last few decades, there has been a shift in energy sources from fossil fuels to cleaner energy sources, such as wind and solar energy, mainly due to environmental concerns and related government regulations . However, these latter sources are depend ent on w eather conditions and require integration with grid technologies for continuous power generation. Natural gas (NG), typically, consists of (up to) ~95% of methane and 2 - 2.5% ethane - hexane+, with the remain der consist ing of nitrogen, CO NG p ower plants are easy to build and highly reliable, mak ing them invaluable for "clean" energy production. On the other hand, m ost countries depend on imports to maintain t heir NG supplies, and there is a delicate balance between import s and domestic demand . S toring excess import ed gas above actual demand is difficult and would result in economic losses, while import ing less than actual demand could result in a nationwide sh ortage.
Learning Conditional Average Treatment Effects in Regression Discontinuity Designs using Bayesian Additive Regression Trees
Alcantara, Rafael, Hahn, P. Richard, Carvalho, Carlos, Lopes, Hedibert
Such designs arise when treatment assignment is based on whether a particular covariate -- referred to as the running variable -- lies above or below a known value, referred to as the cutoff value. Because treatment is deterministically assigned as a known function of the running variable, RDDs are trivially deconfounded: treatment assignment is independent of the outcome variable, given the running variable (because treatment is conditionally constant). However, estimation of treatment effects in RDDs is more complicated than simply controlling for the running variable, because doing so introduces a complete lack of overlap, which is the other key condition needed to justify regression adjustment for causal inference. Nonetheless, treatment effects at the cutoff may still be identified. Specifically, it is well-known that treatment effects at the cutoff can be estimated from RDDs as the magnitude of a discontinuity in the conditional mean response function at that point (Hahn et al., 2001). This paper investigates the use of Bayesian additive regression tree models (Chipman et al., 2010; Hahn et al., 2020) for the purpose of estimating conditional average treatments effects (CATE) at the cutoff, conditional on observed covariates other than the running variable. To the best of our knowledge, such data-driven CATE estimation has not been a focus of the existing RDD literature and we are the first to propose BART for this purpose.
Controlled Model Debiasing through Minimal and Interpretable Updates
Di Gennaro, Federico, Laugel, Thibault, Grari, Vincent, Detyniecki, Marcin
Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, generally without accounting for potentially existing previous models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between new fair model and the existing one should be (i) interpretable and (ii) minimal. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce (i) minimal and (ii) interpretable changes between biased and debiased predictions--a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes. 1 Introduction The increasing adoption of machine learning models in high-stakes domains--such as criminal justice (Klein-berg et al., 2016) and credit lending (Bruckner, 2018)--has raised significant concerns about the potential biases that these models may reproduce and amplify, particularly against historically marginalized groups. Recent public discourse, along with regulatory developments such as the European AI Act (2024/1689), has further underscored the need for adapting AI systems to ensure fairness and trustworthiness (Bringas Col-menarejo et al., 2022). Consequently, many of the machine learning models deployed by organizations are, or may soon be, subject to these emerging regulatory requirements. Yet, such organizations frequently invest significant resources (e.g. The field of algorithmic fairness has experienced rapid growth in recent years, with numerous bias mitigation strategies proposed (Romei & Ruggieri, 2014; Mehrabi et al., 2021). These approaches can be broadly categorized into three types: pre-processing (e.g.,(Belrose et al., 2024)), in-processing (e.g.,(Zhang et al., 2018)), and post-processing(e.g., (Kamiran et al., 2010)), based on the stage of the machine learning pipeline at which fairness is enforced. While the two former categories do not account at all for any pre-existing biased model being available for the task, post-processing approaches aim to impose fairness by directly modifying the predictions of a biased classifier.
LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping
Schmidinger, J., Vogel, S., Barkov, V., Pham, A. -D., Gebbers, R., Tavakoli, H., Correa, J., Tavares, T. R., Filippi, P., Jones, E. J., Lukas, V., Boenecke, E., Ruehlmann, J., Schroeter, I., Kramer, E., Paetzold, S., Kodaira, M., Wadoux, A. M. J. -C., Bragazza, L., Metzger, K., Huang, J., Valente, D. S. M., Safanelli, J. L., Bottega, E. L., Dalmolin, R. S. D., Farkas, C., Steiger, A., Horst, T. Z., Ramirez-Lopez, L., Scholten, T., Stumpf, F., Rosso, P., Costa, M. M., Zandonadi, R. S., Wetterlind, J., Atzmueller, M.
Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially misleading conclusions. To address these issues, we introduce an open-access dataset collection called Precision Liming Soil Datasets (LimeSoDa). LimeSoDa consists of 31 field- and farm-scale datasets from various countries. Each dataset has three target soil properties: (1) soil organic matter or soil organic carbon, (2) clay content and (3) pH, alongside a set of features. Features are dataset-specific and were obtained by optical spectroscopy, proximal- and remote soil sensing. All datasets were aligned to a tabular format and are ready-to-use for modeling. We demonstrated the use of LimeSoDa for benchmarking by comparing the predictive performance of four learning algorithms across all datasets. This comparison included multiple linear regression (MLR), support vector regression (SVR), categorical boosting (CatBoost) and random forest (RF). The results showed that although no single algorithm was universally superior, certain algorithms performed better in specific contexts. MLR and SVR performed better on high-dimensional spectral datasets, likely due to better compatibility with principal components. In contrast, CatBoost and RF exhibited considerably better performances when applied to datasets with a moderate number (< 20) of features. These benchmarking results illustrate that the performance of a method is highly context-dependent. LimeSoDa therefore provides an important resource for improving the development and evaluation of statistical methods in DSM.
Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula
Vilucchio, Matteo, Dandi, Yatin, Gerbelot, Cedric, Krzakala, Florent
The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provided remarkable predictions for the behavior of high-dimensional optimization problems, but rigorously establishing their validity for non-convex problems has remained a fundamental challenge. In this work, we address this challenge by developing a systematic framework that rigorously proves replica-symmetric formulas for non-convex GLMs and precisely determines the conditions under which these formulas are valid. Remarkably, the rigorous replica-symmetric predictions align exactly with the conjectures made by physicists, and the so-called replicon condition. The originality of our approach lies in connecting two powerful theoretical tools: the Gaussian Min-Max Theorem, which we use to provide precise lower bounds, and Approximate Message Passing (AMP), which is shown to achieve these bounds algorithmically. We demonstrate the utility of this framework through significant applications: (i) by proving the optimality of the Tukey loss over the more commonly used Huber loss under a $\varepsilon$ contaminated data model, (ii) establishing the optimality of negative regularization in high-dimensional non-convex regression and (iii) characterizing the performance limits of linearized AMP algorithms. By rigorously validating statistical physics predictions in non-convex settings, we aim to open new pathways for analyzing increasingly complex optimization landscapes beyond the convex regime.
District Vitality Index Using Machine Learning Methods for Urban Planners
Marcoux, Sylvain, Dessureault, Jean-Sรฉbastien
City leaders face critical decisions regarding budget allocation and investment priorities. How can they identify which city districts require revitalization? To address this challenge, a Current Vitality Index and a Long-Term Vitality Index are proposed. These indexes are based on a carefully curated set of indicators. Missing data is handled using K-Nearest Neighbors imputation, while Random Forest is employed to identify the most reliable and significant features. Additionally, k-means clustering is utilized to generate meaningful data groupings for enhanced monitoring of Long-Term Vitality. Current vitality is visualized through an interactive map, while Long-Term Vitality is tracked over 15 years with predictions made using Multilayer Perceptron or Linear Regression. The results, approved by urban planners, are already promising and helpful, with the potential for further improvement as more data becomes available. This paper proposes leveraging machine learning methods to optimize urban planning and enhance citizens' quality of life.
Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis
Ahbabi, Hamdan Al, Marti, Gautier, AlMarri, Saeed, Elfadel, Ibrahim
--Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features.
Design of Cavity Backed Slotted Antenna using Machine Learning Regression Model
Sutrakar, Vijay Kumar, PK, Anjana, Bisariya, Rohit, KK, Soumya, M, Gopal Chawan
In this paper, a regression-based machine learning model is used for the design of cavity backed slotted antenna. This type of antenna is commonly used in military and aviation communication systems. Initial reflection coefficient data of cavity backed slotted antenna is generated using electromagnetic solver. These reflection coefficient data is then used as input for training regression-based machine learning model. The model is trained to predict the dimensions of cavity backed slotted antenna based on the input reflection coefficient for a wide frequency band varying from 1 GHz to 8 GHz. This approach allows for rapid prediction of optimal antenna configurations, reducing the need for repeated physical testing and manual adjustments, may lead to significant amount of design and development cost saving. The proposed model also demonstrates its versatility in predicting multi frequency resonance across 1 GHz to 8 GHz. Also, the proposed approach demonstrates the potential for leveraging machine learning in advanced antenna design, enhancing efficiency and accuracy in practical applications such as radar, military identification systems and secure communication networks.