Goto

Collaborating Authors

 Markov Models


Privileged Sensing Scaffolds Reinforcement Learning

arXiv.org Artificial Intelligence

We need to look at our shoelaces as we first learn to tie them but having mastered this skill, we can do it from touch alone. We call this phenomenon "sensory scaffolding": observation streams that are not needed by a master might yet aid a novice learner. We consider such sensory scaffolding setups for training artificial agents. For example, a robot arm may need to be deployed with just a low-cost, robust, general-purpose camera; yet its performance may improve by having privileged training-time-only access to informative albeit expensive and unwieldy motion capture rigs or fragile tactile sensors. For these settings, we propose Scaffolder, a reinforcement learning approach which effectively exploits privileged sensing in critics, world models, reward estimators, and other such auxiliary components that are only used at training time, to improve the target policy. For evaluating sensory scaffolding agents, we design a new "S3" suite of ten diverse simulated robotic tasks that explore a wide range of practical sensor setups. Agents must use privileged camera sensing to train blind hurdlers, privileged active visual perception to help robot arms overcome visual occlusions, privileged touch sensors to train robot hands, and more. Scaffolder easily outperforms relevant prior baselines and frequently performs comparably even to policies that have test-time access to the privileged sensors. It is well-known that Beethoven composed symphonies long after he had fully lost his hearing. Such feats are commonly held to be evidence of mastery: for example, novice typists need to look at the keyboard to locate keys but with practice, can graduate to typing without looking. Thus, sensing requirements may be different during learning versus after learning. We refer to this as "sensory scaffolding", drawing inspiration from the concept of scaffolding teaching mechanisms in psychology that provide temporary support for a student (Wood et al., 1976; Vygotsky et al., 2011), like training wheels when learning to ride a bicycle. For artificial learning agents such as robots, sensory scaffolding permits decoupling the observation streams required at test time from those that are used to train the agent. The sensors available in a deployed robot are often decided by practical considerations such as cost, robustness, size, compute requirements, and ease of instrumentation, e.g., autonomous cars with only cheap and robust RGB camera sensors. However, those considerations might carry less weight at training time, so a robot learning practitioner may choose to scaffold policy learning with privileged information (Vapnik & Vashist, 2009) from extra sensors available only at training. In the case of the cars above, the manufacturer might equip a small fleet of training cars with expensive privileged sensors like lidar to improve RGB-only driving policies for customers to install in their cars.


Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

The study of behavioral diversity in Multi-Agent Reinforcement Learning (MARL) is a nascent yet promising field. In this context, the present work deals with the question of how to control the diversity of a multi-agent system. With no existing approaches to control diversity to a set value, current solutions focus on blindly promoting it via intrinsic rewards or additional loss functions, effectively changing the learning objective and lacking a principled measure for it. To address this, we introduce Diversity Control (DiCo), a method able to control diversity to an exact value of a given metric by representing policies as the sum of a parameter-shared component and dynamically scaled per-agent components. By applying constraints directly to the policy architecture, DiCo leaves the learning objective unchanged, enabling its applicability to any actor-critic MARL algorithm. We theoretically prove that DiCo achieves the desired diversity, and we provide several experiments, both in cooperative and competitive tasks, that show how DiCo can be employed as a novel paradigm to increase performance and sample efficiency in MARL. Multimedia results are available on the paper's website: https://sites.google.com/view/dico-marl.


Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Intelligent Cyber-Physical Systems (ICPS) represent a specialized form of Cyber-Physical System (CPS) that incorporates intelligent components, notably Convolutional Neural Networks (CNNs) and Deep Reinforcement Learning (DRL), to undertake multifaceted tasks encompassing perception, decision-making, and control. The utilization of DRL for decision-making facilitates dynamic interaction with the environment, generating control actions aimed at maximizing cumulative rewards. Nevertheless, the inherent uncertainty of the operational environment and the intricate nature of ICPS necessitate exploration within complex and dynamic state spaces during the learning phase. DRL confronts challenges in terms of efficiency, generalization capabilities, and data scarcity during decision-making process. In response to these challenges, we propose an innovative abstract modeling approach grounded in spatial-temporal value semantics, capturing the evolution in the distribution of semantic value across time and space. A semantics-based abstraction is introduced to construct an abstract Markov Decision Process (MDP) for the DRL learning process. Furthermore, optimization techniques for abstraction are delineated, aiming to refine the abstract model and mitigate semantic gaps between abstract and concrete states. The efficacy of the abstract modeling is assessed through the evaluation and analysis of the abstract MDP model using PRISM. A series of experiments are conducted, involving diverse scenarios such as lane-keeping, adaptive cruise control, and intersection crossroad assistance, to demonstrate the effectiveness of our abstracting approach.


Reinforcing Language Agents via Policy Optimization with Action Decomposition

arXiv.org Artificial Intelligence

Language models as intelligent agents push the boundaries of sequential decision-making agents but struggle with limited knowledge of environmental dynamics and exponentially huge action space. Recent efforts like GLAM and TWOSOME manually constrain the action space to a restricted subset and employ reinforcement learning to align agents' knowledge with specific environments. However, they overlook fine-grained credit assignments for intra-action tokens, which is essential for efficient language agent optimization, and rely on human's prior knowledge to restrict action space. This paper proposes decomposing language agent optimization from the action level to the token level, offering finer supervision for each intra-action token and manageable optimization complexity in environments with unrestricted action spaces. Beginning with the simplification of flattening all actions, we theoretically explore the discrepancies between action-level optimization and this naive token-level optimization. We then derive the Bellman backup with Action Decomposition (BAD) to integrate credit assignments for both intra-action and inter-action tokens, effectively eliminating the discrepancies. Implementing BAD within the PPO algorithm, we introduce Policy Optimization with Action Decomposition (POAD). POAD benefits from a finer-grained credit assignment process and lower optimization complexity, leading to enhanced learning efficiency and generalization abilities in aligning language agents with interactive environments. We validate POAD across diverse testbeds, with results affirming the advantages of our approach and the correctness of our theoretical analysis.


A Counterfactual Analysis of the Dishonest Casino

arXiv.org Artificial Intelligence

The dishonest casino is a well-known hidden Markov model (HMM) used in educational settings to introduce HMMs and graphical models. Here, a sequence of die rolls is observed, with the casino switching between a fair and a loaded die. Typically, the goal is to use the observed rolls to infer the pattern of fair and loaded dice, leading to filtering, smoothing, and Viterbi algorithms. This paper, however, explores how much of the winnings is attributable to the casino's cheating, a counterfactual question beyond the scope of HMM primitives. To address this, we introduce a structural causal model (SCM) consistent with the HMM and show that the expected winnings attributable to cheating (EWAC) can be bounded using linear programs (LPs). Through numerical experiments, we compute these bounds and develop intuition using benchmark SCMs based on independence, comonotonic, and counter-monotonic copulas. We show that tighter bounds are obtained with a time-homogeneity condition on the SCM, while looser bounds allow for an almost explicit LP solution. Domain-specific knowledge like pathwise monotonicity or counterfactual stability can be incorporated via linear constraints. Our work contributes to bounding counterfactuals in causal inference and is the first to develop LP bounds in a dynamic HMM setting, benefiting educational contexts where counterfactual inference is taught.


ULTRA-MC: A Unified Approach to Learning Mixtures of Markov Chains via Hitting Times

arXiv.org Artificial Intelligence

This study introduces a novel approach for learning mixtures of Markov chains, a critical process applicable to various fields, including healthcare and the analysis of web users. Existing research has identified a clear divide in methodologies for learning mixtures of discrete and continuous-time Markov chains, while the latter presents additional complexities for recovery accuracy and efficiency. We introduce a unifying strategy for learning mixtures of discrete and continuous-time Markov chains, focusing on hitting times, which are well defined for both types. Specifically, we design a reconstruction algorithm that outputs a mixture which accurately reflects the estimated hitting times and demonstrates resilience to noise. We introduce an efficient gradient-descent approach, specifically tailored to manage the computational complexity and non-symmetric characteristics inherent in the calculation of hitting time derivatives. Our approach is also of significant interest when applied to a single Markov chain, thus extending the methodologies previously established by Hoskins et al. and Wittmann et al. We complement our theoretical work with experiments conducted on synthetic and real-world datasets, providing a comprehensive evaluation of our methodology.


Creativity and Markov Decision Processes

arXiv.org Artificial Intelligence

Creativity is already regularly attributed to AI systems outside specialised computational creativity (CC) communities. However, the evaluation of creativity in AI at large typically lacks grounding in creativity theory, which can promote inappropriate attributions and limit the analysis of creative behaviour. While CC researchers have translated psychological theory into formal models, the value of these models is limited by a gap to common AI frameworks. To mitigate this limitation, we identify formal mappings between Boden's process theory of creativity and Markov Decision Processes (MDPs), using the Creative Systems Framework as a stepping stone. We study three out of eleven mappings in detail to understand which types of creative processes, opportunities for (aberrations), and threats to creativity (uninspiration) could be observed in an MDP. We conclude by discussing quality criteria for the selection of such mappings for future work and applications.


A Poisson-Gamma Dynamic Factor Model with Time-Varying Transition Dynamics

arXiv.org Artificial Intelligence

Probabilistic approaches for handling count-valued time sequences have attracted amounts of research attentions because their ability to infer explainable latent structures and to estimate uncertainties, and thus are especially suitable for dealing with \emph{noisy} and \emph{incomplete} count data. Among these models, Poisson-Gamma Dynamical Systems (PGDSs) are proven to be effective in capturing the evolving dynamics underlying observed count sequences. However, the state-of-the-art PGDS still fails to capture the \emph{time-varying} transition dynamics that are commonly observed in real-world count time sequences. To mitigate this gap, a non-stationary PGDS is proposed to allow the underlying transition matrices to evolve over time, and the evolving transition matrices are modeled by sophisticatedly-designed Dirichlet Markov chains. Leveraging Dirichlet-Multinomial-Beta data augmentation techniques, a fully-conjugate and efficient Gibbs sampler is developed to perform posterior simulation. Experiments show that, in comparison with related models, the proposed non-stationary PGDS achieves improved predictive performance due to its capacity to learn non-stationary dependency structure captured by the time-evolving transition matrices.


Agent Planning with World Knowledge Model

arXiv.org Artificial Intelligence

Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ''real'' physical world. Imitating humans' mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent's understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. Code will be available at https://github.com/zjunlp/WKM.


Deterministic Policies for Constrained Reinforcement Learning in Polynomial-Time

arXiv.org Artificial Intelligence

Constrained Reinforcement Learning (CRL) traditionally produces stochastic, expectationconstrained policies that can behave undesirably - imagine a self-driving car that randomly changes lanes or runs out of fuel. However, artificial decision-making systems must be predictable, trustworthy, and robust. One approach to ensuring these qualities is to focus on deterministic policies, which are inherently predictable and trustworthy. Moreover, they are easy to implement [10], reliable for autonomous vehicles [16, 12], and effective for multi-agent coordination [23]. Similarly, almost sure and anytime constraints [21] provide inherent trustworthiness and robustness, essential for applications in medicine [6, 22, 18], disaster relief [9, 29, 27], and resource management [20, 19, 24, 4]. Despite the advantages of deterministic policies and stricter constraints, their computation remains an open challenge in CRL. Our research aims to address this challenge by studying the computational complexity of computing deterministic policies for a wide range of constraint types. Consider a constrained Markov Decision Process (cMDP) denoted by M. Let C represent an arbitrary cost criterion and B be the available budget.