Goto

Collaborating Authors

 alt


Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

arXiv.org Machine Learning

We study the $(\varepsilon, ฮด)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/ฮด)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/ฮด)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.



Efficient Clustering in Stochastic Bandits

arXiv.org Machine Learning

We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.


DeepSalt: Bridging Laboratory and Satellite Spectra through Domain Adaptation and Knowledge Distillation for Large-Scale Soil Salinity Estimation

arXiv.org Artificial Intelligence

Soil salinization poses a significant threat to both ecosystems and agriculture because it limits plants' ability to absorb water and, in doing so, reduces crop productivity. This phenomenon alters the soil's spectral properties, creating a measurable relationship between salinity and light reflectance that enables remote monitoring. While laboratory spectroscopy provides precise measurements, its reliance on in-situ sampling limits scalability to regional or global levels. Conversely, hyperspectral satellite imagery enables wide-area observation but lacks the fine-grained interpretability of laboratory instruments. To bridge this gap, we introduce DeepSalt, a deep-learning-based spectral transfer framework that leverages knowledge distillation and a novel Spectral Adaptation Unit to transfer high-resolution spectral insights from laboratory-based spectroscopy to satellite-based hyperspectral sensing. Our approach eliminates the need for extensive ground sampling while enabling accurate, large-scale salinity estimation, as demonstrated through comprehensive empirical benchmarks. DeepSalt achieves significant performance gains over methods without explicit domain adaptation, underscoring the impact of the proposed Spectral Adaptation Unit and the knowledge distillation strategy. The model also effectively generalized to unseen geographic regions, explaining a substantial portion of the salinity variance.



Japanese AI developer Alt goes bust after accounting fraud

The Japan Times

Japanese artificial intelligence developer Alt, which revealed accounting irregularities recently, has filed for bankruptcy protection with the Tokyo District Court under the civil rehabilitation law. The court accepted the application, according to the Tokyo-based company's announcement Wednesday. Alt left debts totaling about 2.4 billion ( 16.1 million) and aims for its rehabilitation by finding a sponsor entity that will take over its operations. Alt's line of business includes a service to create meeting summaries using AI. The company went public on the Tokyo Stock Exchange's Growth section for startups in October 2024, 10 years after its establishment in 2014.


Policy Testing in Markov Decision Processes

arXiv.org Machine Learning

We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.


ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

arXiv.org Machine Learning

We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.


Cost-Aware Optimal Pairwise Pure Exploration

arXiv.org Machine Learning

Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure exploration, with a focus on identifying the pairwise relationships between targeted arm pairs. Moreover, unlike existing works that only optimize the stopping time (i.e., sample complexity), this work considers that arms are associated with potentially different costs and targets at optimizing the cumulative cost that occurred during learning. Under the general framework of pairwise pure exploration with arm-specific costs, a performance lower bound is derived. Then, a novel algorithm, termed CAET (Cost-Aware Pairwise Exploration Task), is proposed. CAET builds on the track-and-stop principle with a novel design to handle the arm-specific costs, which can potentially be zero and thus represent a very challenging case. Theoretical analyses prove that the performance of CAET approaches the lower bound asymptotically. Special cases are further discussed, including an extension to regret minimization, which is another major focus of MAB. The effectiveness and efficiency of CAET are also verified through experimental results under various settings.


Optimal Best Arm Identification with Post-Action Context

arXiv.org Artificial Intelligence

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a $\textit{post-action context}$ in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) $\textit{non-separator}$, where the reward depends on both the action and the context, and (ii) $\textit{separator}$, where the reward depends solely on the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. For the separator setting, we propose a novel sampling rule called $\textit{G-tracking}$, which uses the geometry of the context space to directly track the contexts rather than the actions. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.