Plotting

 arXiv.org Machine Learning


Kullback-Leibler excess risk bounds for exponential weighted aggregation in Generalized linear models

arXiv.org Machine Learning

Aggregation methods have emerged as a powerful and flexible framework in statistical learning, providing unified solutions across diverse problems such as regression, classification, and density estimation. In the context of generalized linear models (GLMs), where responses follow exponential family distributions, aggregation offers an attractive alternative to classical parametric modeling. This paper investigates the problem of sparse aggregation in GLMs, aiming to approximate the true parameter vector by a sparse linear combination of predictors. We prove that an exponential weighted aggregation scheme yields a sharp oracle inequality for the Kullback-Leibler risk with leading constant equal to one, while also attaining the minimax-optimal rate of aggregation. These results are further enhanced by establishing high-probability bounds on the excess risk.


DUE: A Deep Learning Framework and Library for Modeling Unknown Equations

arXiv.org Machine Learning

Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown equations from measurement or simulation data. This paradigm shift, known as data-driven discovery or modeling, stands at the forefront of AI for science, with significant progress made in recent years. In this paper, we introduce a systematic framework for data-driven modeling of unknown equations using deep learning. This versatile framework is capable of learning unknown ODEs, PDEs, DAEs, IDEs, SDEs, reduced or partially observed systems, and non-autonomous differential equations. Based on this framework, we have developed Deep Unknown Equations (DUE), an open-source software package designed to facilitate the data-driven modeling of unknown equations using modern deep learning techniques. DUE serves as an educational tool for classroom instruction, enabling students and newcomers to gain hands-on experience with differential equations, data-driven modeling, and contemporary deep learning approaches such as FNN, ResNet, generalized ResNet, operator semigroup networks (OSG-Net), and Transformers. Additionally, DUE is a versatile and accessible toolkit for researchers across various scientific and engineering fields. It is applicable not only for learning unknown equations from data but also for surrogate modeling of known, yet complex, equations that are costly to solve using traditional numerical methods. We provide detailed descriptions of DUE and demonstrate its capabilities through diverse examples, which serve as templates that can be easily adapted for other applications.


$\alpha$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models

arXiv.org Machine Learning

Recent efforts have extended the flow-matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous-State Discrete Flow Matching (CS-DFM). Existing CS-DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS-DFM models, under which the existing variants can be understood as operating on different $\alpha$-representations of probabilities. Building upon the theory of information geometry, we introduce $\alpha$-Flow, a family of CS-DFM models that adheres to the canonical $\alpha$-geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for $\alpha$-flow establishes a unified variational bound for the discrete negative log-likelihood. We comprehensively evaluate different instantiations of $\alpha$-flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. $\alpha$-flow significantly outperforms its discrete-state counterpart in image and protein sequence generation and better captures the entropy in language modeling.


Conditional Distribution Compression via the Kernel Conditional Mean Embedding

arXiv.org Machine Learning

Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from $\mathcal{O}(n^3)$ to $\mathcal{O}(n)$. Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.


Towards Weaker Variance Assumptions for Stochastic Optimization

arXiv.org Machine Learning

We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set.


Optimal sparse phase retrieval via a quasi-Bayesian approach

arXiv.org Machine Learning

This paper addresses the problem of sparse phase retrieval, a fundamental inverse problem in applied mathematics, physics, and engineering, where a signal need to be reconstructed using only the magnitude of its transformation while phase information remains inaccessible. Leveraging the inherent sparsity of many real-world signals, we introduce a novel sparse quasi-Bayesian approach and provide the first theoretical guarantees for such an approach. Specifically, we employ a scaled Student distribution as a continuous shrinkage prior to enforce sparsity and analyze the method using the PAC-Bayesian inequality framework. Our results establish that the proposed Bayesian estimator achieves minimax-optimal convergence rates under sub-exponential noise, matching those of state-of-the-art frequentist methods. To ensure computational feasibility, we develop an efficient Langevin Monte Carlo sampling algorithm. Through numerical experiments, we demonstrate that our method performs comparably to existing frequentist techniques, highlighting its potential as a principled alternative for sparse phase retrieval in noisy settings.


Constants of motion network revisited

arXiv.org Machine Learning

Discovering constants of motion is meaningful in helping understand the dynamical systems, but inevitably needs proficient mathematical skills and keen analytical capabilities. With the prevalence of deep learning, methods employing neural networks, such as Constant Of Motion nETwork (COMET), are promising in handling this scientific problem. Although the COMET method can produce better predictions on dynamics by exploiting the discovered constants of motion, there is still plenty of room to sharpen it. In this paper, we propose a novel neural network architecture, built using the singular-value-decomposition (SVD) technique, and a two-phase training algorithm to improve the performance of COMET. Extensive experiments show that our approach not only retains the advantages of COMET, such as applying to non-Hamiltonian systems and indicating the number of constants of motion, but also can be more lightweight and noise-robust than COMET.


aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis

arXiv.org Machine Learning

We introduce aweSOM, an open-source Python package for machine learning (ML) clustering and classification, using a Self-organizing Maps (SOM) algorithm that incorporates CPU/GPU acceleration to accommodate large ($N > 10^6$, where $N$ is the number of data points), multidimensional datasets. aweSOM consists of two main modules, one that handles the initialization and training of the SOM, and another that stacks the results of multiple SOM realizations to obtain more statistically robust clusters. Existing Python-based SOM implementations (e.g., POPSOM, Yuan (2018); MiniSom, Vettigli (2018); sklearn-som) primarily serve as proof-of-concept demonstrations, optimized for smaller datasets, but lacking scalability for large, multidimensional data. aweSOM provides a solution for this gap in capability, with good performance scaling up to $\sim 10^8$ individual points, and capable of utilizing multiple features per point. We compare the code performance against the legacy implementations it is based on, and find a 10-100x speed up, as well as significantly improved memory efficiency, due to several built-in optimizations.


Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand

arXiv.org Machine Learning

In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms to solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at https://github.com/gundemkorel/Inventory_Pricing_Control


Ordinary Least Squares as an Attention Mechanism

arXiv.org Machine Learning

I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.