Learning Graphical Models
SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning
Guo, Weiqi, Liu, Guanjun, Zhou, Ziyuan
Multi-Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state-only attacks or action-only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State-Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi-step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic's guidance. We evaluate SAJA in the Multi-Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state-only or action-only attacks, and (2) existing state or action defense methods cannot defend its attacks.
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
He, Yufei, Liu, Juncheng, Liu, Yue, Li, Yibo, Cao, Tri, Hu, Zhiyuan, Xu, Xinxing, Hooi, Bryan
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
Performance Evaluation of Ising and QUBO Variable Encodings in Boltzmann Machine Learning
Hasegawa, Yasushi, Ohzeki, Masayuki
We compare Ising ({-1,+1}) and QUBO ({0,1}) encodings for Boltzmann machine learning under a controlled protocol that fixes the model, sampler, and step size. Exploiting the identity that the Fisher information matrix (FIM) equals the covariance of sufficient statistics, we visualize empirical moments from model samples and reveal systematic, representation-dependent differences. QUBO induces larger cross terms between first- and second-order statistics, creating more small-eigenvalue directions in the FIM and lowering spectral entropy. This ill-conditioning explains slower convergence under stochastic gradient descent (SGD). In contrast, natural gradient descent (NGD)-which rescales updates by the FIM metric-achieves similar convergence across encodings due to reparameterization invariance. Practically, for SGD-based training, the Ising encoding provides more isotropic curvature and faster convergence; for QUBO, centering/scaling or NGD-style preconditioning mitigates curvature pathologies. These results clarify how representation shapes information geometry and finite-time learning dynamics in Boltzmann machines and yield actionable guidelines for variable encoding and preprocessing.
Safe Driving in Occluded Environments
Wang, Zhuoyuan, Jia, Tongyao, Rajborirug, Pharuj, Ramesh, Neeraj, Okuda, Hiroyuki, Suzuki, Tatsuya, Kar, Soummya, Nakahira, Yorie
Abstract--Ensuring safe autonomous driving in the presence of occlusions poses a significant challenge in its policy design. While existing model-driven control techniques based on set invariance can handle visible risks, occlusions create latent risks in which safety-critical states are not observable. Data-driven techniques also struggle to handle latent risks because direct mappings from risk-critical objects in sensor inputs to safe actions cannot be learned without visible risk-critical objects. Motivated by these challenges, in this paper, we propose a probabilistic safety certificate for latent risk. Our key technical enabler is the application of probabilistic invariance: It relaxes the strict observability requirements imposed by set-invariance methods that demand the knowledge of risk-critical states. The proposed techniques provide linear action constraints that confine the latent risk probability within tolerance. Such constraints can be integrated into model predictive controllers or embedded in data-driven policies to mitigate latent risks. The proposed method is tested using the CARLA simulator and compared with a few existing techniques. The theoretical and empirical analysis jointly demonstrate that the proposed methods assure long-term safety in real-time control in occluded environments without being overly conservative and with transparency to exposed risks. ISUAL occlusions impose significant challenges in the policy design of autonomous driving.
Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation
Nayeem, Md., Tabrej, Md Shamse, Deb, Kabbojit Jit, Goswami, Shaonti, Hakim, Md. Azizul
Automatic Speech Recognition (ASR) has undergone a profound transformation over the past decade, driven by advances in deep learning. This survey provides a comprehensive overview of the modern era of ASR, charting its evolution from traditional hybrid systems, such as Gaussian Mixture Model-Hidden Markov Models (GMM-HMMs) and Deep Neural Network-HMMs (DNN-HMMs), to the now-dominant end-to-end neural architectures. We systematically review the foundational end-to-end paradigms: Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), which established the groundwork for fully integrated speech-to-text systems. We then detail the subsequent architectural shift towards Transformer and Conformer models, which leverage self-attention to capture long-range dependencies with high computational efficiency. A central theme of this survey is the parallel revolution in training paradigms. We examine the progression from fully supervised learning, augmented by techniques like SpecAugment, to the rise of self-supervised learning (SSL) with foundation models such as wav2vec 2.0, which drastically reduce the reliance on transcribed data. Furthermore, we analyze the impact of largescale, weakly supervised models like Whisper, which achieve unprecedented robustness through massive data diversity. The paper also covers essential ecosystem components, including key datasets and benchmarks (e.g., LibriSpeech, Switchboard, CHiME), standard evaluation metrics (e.g., Word Error Rate), and critical considerations for real-world deployment, such as streaming inference, on-device efficiency, and the ethical imperatives of fairness and robustness. We conclude by outlining open challenges and future research directions.
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
Van Katwyk, Peter, Bergen, Karianne J.
Uncertainty quantification is critical for ensuring robustness in high-stakes machine learning applications. We introduce HybridFlow, a modular hybrid architecture that unifies the modeling of aleatoric and epistemic uncertainty by combining a Conditional Masked Autoregressive normalizing flow for estimating aleatoric uncertainty with a flexible probabilistic predictor for epistemic uncertainty. The framework supports integration with any probabilistic model class, allowing users to easily adapt HybridFlow to existing architectures without sacrificing predictive performance. HybridFlow improves upon previous uncertainty quantification frameworks across a range of regression tasks, such as depth estimation, a collection of regression benchmarks, and a scientific case study of ice sheet emulation. We also provide empirical results of the quantified uncertainty, showing that the uncertainty quantified by HybridFlow is calibrated and better aligns with model error than existing methods for quantifying aleatoric and epistemic uncertainty. HybridFlow addresses a key challenge in Bayesian deep learning, unifying aleatoric and epistemic uncertainty modeling in a single robust framework.
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Ahn, Jaewoo, Kim, Junseo, Yun, Heeseung, Son, Jaehyeon, Park, Dongmin, Cho, Jaewoong, Kim, Gunhee
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
Asymptotically optimal reinforcement learning in Block Markov Decision Processes
van Vuren, Thomas, Sloothaak, Fiona, Wolf, Maarten G., Sanders, Jaron
The curse of dimensionality renders Reinforcement Learning (RL) impractical in many real-world settings with exponentially large state and action spaces. Yet, many environments exhibit exploitable structure that can accelerate learning. To formalize this idea, we study RL in Block Markov Decision Processes (BMDPs). BMDPs model problems with large observation spaces, but where transition dynamics are fully determined by latent states. Recent advances in clustering methods have enabled the efficient recovery of this latent structure. However, a regret analysis that exploits these techniques to determine their impact on learning performance remained open. We are now addressing this gap by providing a regret analysis that explicitly leverages clustering, demonstrating that accurate latent state estimation can indeed effectively speed up learning. Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first learns the latent structure through random exploration and then switches to an optimism-guided strategy adapted to the uncovered structure. This algorithm achieves a regret that is $O(\sqrt{T}+n)$ on a large class of BMDPs susceptible to clustering. Here, $T$ denotes the number of time steps, $n$ is the cardinality of the observation space, and the Landau notation $O(\cdot)$ holds up to constants and polylogarithmic factors. This improves the best prior bound, $O(\sqrt{T}+n^2)$, especially when $n$ is large. Moreover, we prove that no algorithm can achieve lower regret uniformly on this same class of BMDPs. This establishes that, on this class, the algorithm achieves asymptotic optimality.
Towards an Asymptotic Efficiency Theory on Regular Parameter Manifolds
Sun, Lvfang, Lin, Zhenhua, Liu, Lin
Asymptotic efficiency theory is one of the pillars in the foundations of modern mathematical statistics. Not only does it serve as a rigorous theoretical benchmark for evaluating statistical methods, but it also sheds light on how to develop and unify novel statistical procedures. For example, the calculus of influence functions has led to many important statistical breakthroughs in the past decades. Responding to the pressing challenge of analyzing increasingly complex datasets, particularly those with non-Euclidean/nonlinear structures, many novel statistical models and methods have been proposed in recent years. However, the existing efficiency theory is not always readily applicable to these cases, as the theory was developed, for the most part, under the often neglected premise that both the sample space and the parameter space are normed linear spaces. As a consequence, efficiency results outside normed linear spaces are quite rare and isolated, obtained on a case-by-case basis. This paper aims to develop a more unified asymptotic efficiency theory, allowing the sample space, the parameter space, or both to be Riemannian manifolds satisfying certain regularity conditions. We build a vocabulary that helps translate essential concepts in efficiency theory from normed linear spaces to Riemannian manifolds, such as (locally) regular estimators, differentiable functionals, etc. Efficiency bounds are established under conditions parallel to those for normed linear spaces. We also demonstrate the conceptual advantage of the new framework by applying it to two concrete examples in statistics: the population Frechet mean and the regression coefficient vector of Single-Index Models.
Near-Optimality of Contrastive Divergence Algorithms
Glaser, Pierre, Huang, Kevin Han, Gretton, Arthur
We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an $O(n^{-1 / 3})$ rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate $O(n^{-1 / 2})$. Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramér-Rao lower bound.