Goto

Collaborating Authors

 Bayesian Learning


Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors observed in LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.


MARAuder's Map: Motion-Aware Real-time Activity Recognition with Layout-Based Trajectories

arXiv.org Artificial Intelligence

Ambient sensor-based human activity recognition (HAR) in smart homes remains challenging due to the need for real-time inference, spatially grounded reasoning, and context-aware temporal modeling. Existing approaches often rely on pre-segmented, within-activity data and overlook the physical layout of the environment, limiting their robustness in continuous, real-world deployments. In this paper, we propose MARAuder's Map, a novel framework for real-time activity recognition from raw, unsegmented sensor streams. Our method projects sensor activations onto the physical floorplan to generate trajectory-aware, image-like sequences that capture the spatial flow of human movement. These representations are processed by a hybrid deep learning model that jointly captures spatial structure and temporal dependencies. To enhance temporal awareness, we introduce a learnable time embedding module that encodes contextual cues such as hour-of-day and day-of-week. Additionally, an attention-based encoder selectively focuses on informative segments within each observation window, enabling accurate recognition even under cross-activity transitions and temporal ambiguity. Extensive experiments on multiple real-world smart home datasets demonstrate that our method outperforms strong baselines, offering a practical solution for real-time HAR in ambient sensor environments.


Automatic Extraction of Road Networks by using Teacher-Student Adaptive Structural Deep Belief Network and Its Application to Landslide Disaster

arXiv.org Artificial Intelligence

Abstract--An adaptive structural learning method of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) has been developed as one of prominent deep learning models. The neuron generation-annihilation algorithm in R BM and layer generation algorithm in DBN make an optimal networ k structure for given input during the learning. In this paper, our model is applied to an automatic recognition method of road network system, called RoadTracer . A novel method of RoadTracer using the T eacher-Student base d ensemble learning model of Adaptive DBN is proposed, since t he road maps contain many complicated features so that a model with high representation power to detect should be required . The experimental results showed the detection accuracy of t he proposed model was improved from 40.0% to 89.0% on average in the seven major cities among the test dataset. In addition, we challenged to apply our method to the detection of availab le roads when landslide by natural disaster is occurred, in ord er to rapidly obtain a way of transportation. For fast inferenc e, a small size of the trained model was implemented on a small embedded edge device as lightweight deep learning. Recently there have been more cases of extreme climate events including unexpected and unusual weather. The atten - tion of these events has been received in the last few years, d ue to the significant loss of human lives and escalating economi c costs, as well as the impacts on landslides and changes in ecosystems. In Japan, the Japan Meteorological Agency (JMA) has issued "Climate Change Monitoring Report" every year informing the latest status of climate change. According to [1 ], during the Heavy Rain Event of July 2018, Japan experienced unprecedented heavy rainfall. Overall precipitation obse rved at AMeDAS stations throughout Japan in July 2018 was extremely high in comparison with past heavy rainfall event s since 1982. A prominent characteristic of this rain event is that the record-breaking local precipitation, particularly wi thin 48 to 72 hours, was observed extensively over western Japan and Tokyo region, including the Seto Inland Sea side of Chugoku and Shikoku regions. S. Kamada is with Hiroshima City University, Hiroshima, Jap an T. Ichimura is with Prefectural University of Hiroshima, Hi roshima, Japan In addition, lifelines such as wat er supply and communications damaged, and traffic obstacles occurred over a wide area. Due to the disruption of major roads and railroads, the supply was also suspended.


Approximate Bayesian inference for cumulative probit regression models

arXiv.org Machine Learning

Ordinal categorical data are routinely encountered in a wide range of practical applications. When the primary goal is to construct a regression model for ordinal outcomes, cumulative link models represent one of the most popular choices to link the cumulative probabilities of the response with a set of covariates through a parsimonious linear predictor, shared across response categories. When the number of observations grows, standard sampling algorithms for Bayesian inference scale poorly, making posterior computation increasingly challenging in large datasets. In this article, we propose three scalable algorithms for approximating the posterior distribution of the regression coefficients in cumulative probit models relying on Variational Bayes and Expectation Propagation. We compare the proposed approaches with inference based on Markov Chain Monte Carlo, demonstrating superior computational performance and remarkable accuracy; finally, we illustrate the utility of the proposed algorithms on a challenging case study to investigate the structure of a criminal network.


Wasserstein-Cramรฉr-Rao Theory of Unbiased Estimation

arXiv.org Machine Learning

The quantity of interest in the classical Cramรฉr-Rao theory of unbiased estimation (e.g., the Cramรฉr-Rao lower bound, its exact attainment for exponential families, and asymptotic efficiency of maximum likelihood estimation) is the variance, which represents the instability of an estimator when its value is compared to the value for an independently-sampled data set from the same distribution. In this paper we are interested in a quantity which represents the instability of an estimator when its value is compared to the value for an infinitesimal additive perturbation of the original data set; we refer to this as the "sensitivity" of an estimator. The resulting theory of sensitivity is based on the Wasserstein geometry in the same way that the classical theory of variance is based on the Fisher-Rao (equivalently, Hellinger) geometry, and this insight allows us to determine a collection of results which are analogous to the classical case: a Wasserstein-Cramรฉr-Rao lower bound for the sensitivity of any unbiased estimator, a characterization of models in which there exist unbiased estimators achieving the lower bound exactly, and some concrete results that show that the Wasserstein projection estimator achieves the lower bound asymptotically. We use these results to treat many statistical examples, sometimes revealing new optimality properties for existing estimators and other times revealing entirely new estimators.


Robust Causal Discovery under Imperfect Structural Constraints

arXiv.org Machine Learning

Robust causal discovery from observational data under imperfect prior knowledge remains a significant and largely unresolved challenge. Existing methods typically presuppose perfect priors or can only handle specific, pre-identified error types. And their performance degrades substantially when confronted with flawed constraints of unknown location and type. This decline arises because most of them rely on inflexible and biased thresholding strategies that may conflict with the data distribution. To overcome these limitations, we propose to harmonizes knowledge and data through prior alignment and conflict resolution. First, we assess the credibility of imperfect structural constraints through a surrogate model, which then guides a sparse penalization term measuring the loss between the learned and constrained adjacency matrices. We theoretically prove that, under ideal assumption, the knowledge-driven objective aligns with the data-driven objective. Furthermore, to resolve conflicts when this assumption is violated, we introduce a multi-task learning framework optimized via multi-gradient descent, jointly minimizing both objectives. Our proposed method is robust to both linear and nonlinear settings. Extensive experiments, conducted under diverse noise conditions and structural equation model types, demonstrate the effectiveness and efficiency of our method under imperfect structural constraints.


A Latent-Variable Formulation of the Poisson Canonical Polyadic Tensor Model: Maximum Likelihood Estimation and Fisher Information

arXiv.org Machine Learning

We establish parameter inference for the Poisson canonical polyadic (PCP) tensor model through a latent-variable formulation. Our approach exploits the observation that any random PCP tensor can be derived by marginalizing an unobservable random tensor of one dimension larger. The loglikelihood of this larger dimensional tensor, referred to as the "complete" loglikelihood, is comprised of multiple rank one PCP loglikelihoods. Using this methodology, we first derive non-iterative maximum likelihood estimators for the PCP model and demonstrate that several existing algorithms for fitting non-negative matrix and tensor factorizations are Expectation-Maximization algorithms. Next, we derive the observed and expected Fisher information matrices for the PCP model. The Fisher information provides us crucial insights into the well-posedness of the tensor model, such as the role that tensor rank plays in identifiability and indeterminacy. For the special case of rank one PCP models, we demonstrate that these results are greatly simplified.


Epistemic Reject Option Prediction

arXiv.org Artificial Intelligence

In high-stakes applications, predictive models must not only produce accurate predictions but also quantify and communicate their uncertainty. Reject-option prediction addresses this by allowing the model to abstain when prediction uncertainty is high. Traditional reject-option approaches focus solely on aleatoric uncertainty, an assumption valid only when large training data makes the epistemic uncertainty negligible. However, in many practical scenarios, limited data makes this assumption unrealistic. This paper introduces the epistemic reject-option predictor, which abstains in regions of high epistemic uncertainty caused by insufficient data. Building on Bayesian learning, we redefine the optimal predictor as the one that minimizes expected regret -- the performance gap between the learned model and the Bayes-optimal predictor with full knowledge of the data distribution. The model abstains when the regret for a given input exceeds a specified rejection cost. To our knowledge, this is the first principled framework that enables learning predictors capable of identifying inputs for which the training data is insufficient to make reliable decisions.


Estimating Orbital Parameters of Direct Imaging Exoplanet Using Neural Network

arXiv.org Artificial Intelligence

In this work, we propose a new flow-matching Markov chain Monte Carlo (FM-MCMC) algorithm for estimating the orbital parameters of exoplanetary systems, especially for those only one exoplanet is involved. Compared to traditional methods that rely on random sampling within the Bayesian framework, our approach first leverages flow matching posterior estimation (FMPE) to efficiently constrain the prior range of physical parameters, and then employs MCMC to accurately infer the posterior distribution. For example, in the orbital parameter inference of beta Pictoris b, our model achieved a substantial speed-up while maintaining comparable accuracy-running 77.8 times faster than Parallel Tempered MCMC (PTMCMC) and 365.4 times faster than nested sampling. Moreover, our FM-MCMC method also attained the highest average log-likelihood among all approaches, demonstrating its superior sampling efficiency and accuracy. This highlights the scalability and efficiency of our approach, making it well-suited for processing the massive datasets expected from future exoplanet surveys. Beyond astrophysics, our methodology establishes a versatile paradigm for synergizing deep generative models with traditional sampling, which can be adopted to tackle complex inference problems in other fields, such as cosmology, biomedical imaging, and particle physics.


DL101 Neural Network Outputs and Loss Functions

arXiv.org Artificial Intelligence

The loss function used to train a neural network is strongly connected to its output layer from a statistical point of view. This technical report analyzes common activation functions for a neural network output layer, like linear, sigmoid, ReLU, and softmax, detailing their mathematical properties and their appropriate use cases. A strong statistical justification exists for the selection of the suitable loss function for training a deep learning model. This report connects common loss functions such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and various Cross-Entropy losses to the statistical principle of Maximum Likelihood Estimation (MLE). Choosing a specific loss function is equivalent to assuming a specific probability distribution for the model output, highlighting the link between these functions and the Generalized Linear Models (GLMs) that underlie network output layers. Additional scenarios of practical interest are also considered, such as alternative output encodings, constrained outputs, and distributions with heavy tails.