Undirected Networks
Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems
Ororbia, Alexander, Mali, Ankur
One of the key goals of brain-inspired computing is to develop methods that draw inspiration from computational neuroscience and cognitive science to build effective adaptive and efficient agents that are capable of intelligently interacting with their environment. Notably, brain-inspired computational research seeks to develop intelligent systems that are capable of circumventing the current limitations of modern-day approaches [1, 2], such as deep neural networks trained by the popular backpropagation of errors (or backprop)[3]. This goal is complementary to (and, to an extent, even a precursor to some elements of) the domain of neurorobotics [4, 5], which focuses on designing robotic devices that contain control systems based on or are inspired by principles of animal/human nervous systems and/or brain structures guided by the key premise that (neural) models are embodied in a body and an environment. While the gap between neurorobotics and many brain-inspired approaches largely is largely divided between focus on real-world hardware (the former) or software simulation (the latter), one pathway to bridging this gap might lie in developing powerful brain-inspired approaches that scale up to and operate robustly on problems that may ultimately be tackled by embodied robotic systems as well as using higher-quality, more realistic simulation platforms (as we do in this work). It is along this path that this work takes a step forward by developing a neurobiologically-grounded neural circuit that is used to craft a complete agent that can tackle extremely sparse reward learning control problems (tested on a more realistic, higher quality robotic system simulator), a problem that many robotic systems must ultimately face, much as humans and animals do in the real world. To build such building neural blocks and an agent system, we start from two neurocognitive theoretical foundations, predictive processing (or coding) and planning-as-inference. With respect to predictive coding, which views the brain as a type of hierarchical, pattern-creation engine [6] that engages in continual self-correction [7], we implement a fundamental circuit where each of its levels/regions are implemented by clusters of neurons that attempt to predict the state of other neural clusters/regions and adjust their synapses based on how different their predictions were from observed signals. This allows us to sidestep many of the key issues central to backprop, such as the vanishing/exploding gradient problems [8], the requirement for a long, unstable credit assignment feedback pathway [9], forward and backward locking problems [10], and the need for differentiability [11, 9]. On the other hand, motivated by planningarXiv:2209.09174v1
Partial sequence labeling with structured Gaussian Processes
Lu, Xiaolei, Chow, Tommy W. S.
Existing partial sequence labeling models mainly focus on max-margin framework which fails to provide an uncertainty estimation of the prediction. Further, the unique ground truth disambiguation strategy employed by these models may include wrong label information for parameter learning. In this paper, we propose structured Gaussian Processes for partial sequence labeling (SGPPSL), which encodes uncertainty in the prediction and does not need extra effort for model selection and hyperparameter learning. The model employs factor-as-piece approximation that divides the linear-chain graph structure into the set of pieces, which preserves the basic Markov Random Field structure and effectively avoids handling large number of candidate output sequences generated by partially annotated data. Then confidence measure is introduced in the model to address different contributions of candidate labels, which enables the ground-truth label information to be utilized in parameter learning. Based on the derived lower bound of the variational lower bound of the proposed model, variational parameters and confidence measures are estimated in the framework of alternating optimization. Moreover, weighted Viterbi algorithm is proposed to incorporate confidence measure to sequence prediction, which considers label ambiguity arose from multiple annotations in the training data and thus helps improve the performance. SGPPSL is evaluated on several sequence labeling tasks and the experimental results show the effectiveness of the proposed model.
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes
Fu, Zuyue, Qi, Zhengling, Wang, Zhaoran, Yang, Zhuoran, Xu, Yanxun, Kosorok, Michael R.
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions. Lastly, our extensive theoretical investigations and one numerical study motivated by the kidney transplantation demonstrate the promising performance of the proposed methods.
A Map-matching Algorithm with Extraction of Multi-group Information for Low-frequency Data
Fang, Jie, Wu, Xiongwei, Lin, Dianchao, Xu, Mengyun, Wu, Huahua, Wu, Xuesong, Bi, Ting
The growing use of probe vehicles generates a huge number of GNSS data. Limited by the satellite positioning technology, further improving the accuracy of map-matching is challenging work, especially for low-frequency trajectories. When matching a trajectory, the ego vehicle's spatial-temporal information of the present trip is the most useful with the least amount of data. In addition, there are a large amount of other data, e.g., other vehicles' state and past prediction results, but it is hard to extract useful information for matching maps and inferring paths. Most map-matching studies only used the ego vehicle's data and ignored other vehicles' data. Based on it, this paper designs a new map-matching method to make full use of "Big data". We first sort all data into four groups according to their spatial and temporal distance from the present matching probe which allows us to sort for their usefulness. Then we design three different methods to extract valuable information (scores) from them: a score for speed and bearing, a score for historical usage, and a score for traffic state using the spectral graph Markov neutral network. Finally, we use a modified top-K shortest-path method to search the candidate paths within an ellipse region and then use the fused score to infer the path (projected location). We test the proposed method against baseline algorithms using a real-world dataset in China. The results show that all scoring methods can enhance map-matching accuracy. Furthermore, our method outperforms the others, especially when GNSS probing frequency is less than 0.01 Hz.
Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback
Dai, Yan, Luo, Haipeng, Chen, Liyu
We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observes the losses for the visited state-action pairs (i.e., bandit feedback). While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods, which are usually computationally more efficient and also easier to implement since it only requires solving an offline planning problem. Motivated by this, we take a closer look at FTPL for learning AMDPs, starting from the standard episodic finite-horizon setting. We find some unique and intriguing difficulties in the analysis and propose a workaround to eventually show that FTPL is also able to achieve near-optimal regret bounds in this case. More importantly, we then find two significant applications: First, the analysis of FTPL turns out to be readily generalizable to delayed bandit feedback with order-optimal regret, while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using FTPL, we also develop the first no-regret algorithm for learning communicating AMDPs in the infinite-horizon setting with bandit feedback and stochastic transitions. Our algorithm is efficient assuming access to an offline planning oracle, while even for the easier full-information setting, the only existing algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.
Towards Robust Off-Policy Evaluation via Human Inputs
Singh, Harvineet, Joshi, Shalmali, Doshi-Velez, Finale, Lakkaraju, Himabindu
Off-policy Evaluation (OPE) methods are crucial tools for evaluating policies in high-stakes domains such as healthcare, where direct deployment is often infeasible, unethical, or expensive. When deployment environments are expected to undergo changes (that is, dataset shifts), it is important for OPE methods to perform robust evaluation of the policies amidst such changes. Existing approaches consider robustness against a large class of shifts that can arbitrarily change any observable property of the environment. This often results in highly pessimistic estimates of the utilities, thereby invalidating policies that might have been useful in deployment. In this work, we address the aforementioned problem by investigating how domain knowledge can help provide more realistic estimates of the utilities of policies. We leverage human inputs on which aspects of the environments may plausibly change, and adapt the OPE methods to only consider shifts on these aspects. Specifically, we propose a novel framework, Robust OPE (ROPE), which considers shifts on a subset of covariates in the data based on user inputs, and estimates worst-case utility under these shifts. We then develop computationally efficient algorithms for OPE that are robust to the aforementioned shifts for contextual bandits and Markov decision processes. We also theoretically analyze the sample complexity of these algorithms. Extensive experimentation with synthetic and real world datasets from the healthcare domain demonstrates that our approach not only captures realistic dataset shifts accurately, but also results in less pessimistic policy evaluations.
Stability and Generalization for Markov Chain Stochastic Gradient Methods
Wang, Puyu, Lei, Yunwen, Ying, Yiming, Zhou, Ding-Xuan
Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learning theory. For empirical risk minimization (ERM) problems, we establish the optimal excess population risk bounds for both smooth and non-smooth cases by introducing on-average argument stability. For minimax problems, we develop a quantitative connection between on-average argument stability and generalization error which extends the existing results for uniform stability \cite{lei2021stability}. We further develop the first nearly optimal convergence rates for convex-concave problems both in expectation and with high probability, which, combined with our stability results, show that the optimal generalization bounds can be attained for both smooth and non-smooth cases. To the best of our knowledge, this is the first generalization analysis of SGMs when the gradients are sampled from a Markov process.
Disentangling Shape and Pose for Object-Centric Deep Active Inference Models
Ferraro, Stefano, Van de Maele, Toon, Mazzaglia, Pietro, Verbelen, Tim, Dhoedt, Bart
Active inference is a first principles approach for understanding the brain in particular, and sentient agents in general, with the single imperative of minimizing free energy. As such, it provides a computational account for modelling artificial intelligent agents, by defining the agent's generative model and inferring the model parameters, actions and hidden state beliefs. However, the exact specification of the generative model and the hidden state space structure is left to the experimenter, whose design choices influence the resulting behaviour of the agent. Recently, deep learning methods have been proposed to learn a hidden state space structure purely from data, alleviating the experimenter from this tedious design task, but resulting in an entangled, non-interpreteable state space. In this paper, we hypothesize that such a learnt, entangled state space does not necessarily yield the best model in terms of free energy, and that enforcing different factors in the state space can yield a lower model complexity. In particular, we consider the problem of 3D object representation, and focus on different instances of the ShapeNet dataset. We propose a model that factorizes object shape, pose and category, while still learning a representation for each factor using a deep neural network. We show that models, with best disentanglement properties, perform best when adopted by an active agent in reaching preferred observations.
Paper review: PAIRED
When browsing through new data science papers, from time to time you encounter clever new ideas. And even though they sometimes aren't broadly adopted yet, they can yield great potential. I believe the paper I'll talk about today is one of those papers: "Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design" by Michael Dennis, Natasha Jaques et al. from the Google Brain team. The title is a mouthful, but in short, the paper talks about an automated way to generate increasingly challenging environments for RL models. You can find the paper here.
ProAPT: Projection of APT Threats with Deep Reinforcement Learning
Dehghan, Motahareh, Sadeghiyan, Babak, Khosravian, Erfan, Moghaddam, Alireza Sedighi, Nooshi, Farshid
The highest level in the Endsley situation awareness model is called projection when the status of elements in the environment in the near future is predicted. In cybersecurity situation awareness, the projection for an Advanced Persistent Threat (APT) requires predicting the next step of the APT. The threats are constantly changing and becoming more complex. As supervised and unsupervised learning methods require APT datasets for projecting the next step of APTs, they are unable to identify unknown APT threats. In reinforcement learning methods, the agent interacts with the environment, and so it might project the next step of known and unknown APTs. So far, reinforcement learning has not been used to project the next step for APTs. In reinforcement learning, the agent uses the previous states and actions to approximate the best action of the current state. When the number of states and actions is abundant, the agent employs a neural network which is called deep learning to approximate the best action of each state. In this paper, we present a deep reinforcement learning system to project the next step of APTs. As there exists some relation between attack steps, we employ the Long- Short-Term Memory (LSTM) method to approximate the best action of each state. In our proposed system, based on the current situation, we project the next steps of APT threats.