Goto

Collaborating Authors

 Markov Models


Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes

arXiv.org Artificial Intelligence

Reinforcement learning (RL) (1) refers to a class of decision-making problems in which an agent must learn through trial-and-error to act in such a way that maximizes its accumulated return, as encoded by a scalar reward function that maps the agent's states and actions to immediate rewards. RL algorithms, particularly their combination with deep neural networks referred to as deep RL (DRL) (2), have shown remarkable capabilities in solving complex decision-making problems even with high-dimensional observations in domains such as board games (3), video games (4), healthcare (5), and recommendation systems (6). These successes underscore the potential of DRL for controlling robotic systems with high-dimensional state or observation space and highly nonlinear dynamics to perform challenging tasks that conventional decision-making, planning, and control approaches (e.g., classical control, optimal control, sampling-based planning) cannot handle effectively. Yet, the most notable milestones of DRL so far have been achieved in simulation or game environments, where RL agents can learn from extensive experience. In contrast, robots need to complete tasks in the physical world, which presents additional challenges. It is often inefficient and/or unsafe for the RL agents to collect trial-and-error samples directly in the physical world, and it is usually impossible to create an exact replica of the complex real world in simulation. These challenges notwithstanding, recent advances have enabled DRL to succeed at some real-world robotic tasks. For instance, DRL has enabled champion-level drone racing (7) and versatile quadruped locomotion control integrated into production-level quadruped systems (e.g., ANYbotics


Decentralized and Asymmetric Multi-Agent Learning in Construction Sites

arXiv.org Artificial Intelligence

Multi-agent collaboration involves multiple participants working together in a shared environment to achieve a common goal. These agents share information, divide tasks, and synchronize their actions. Key aspects of multi agent collaboration include coordination, communication, task allocation, cooperation, adaptation, and decentralization. On construction sites, surface grading is the process of leveling sand piles to increase a specific area's height. In this scenario, a bulldozer grades while a dumper allocates sand piles. Our work aims to utilize a multi-agent approach to enable these vehicles to collaborate effectively. To this end, we propose a decentralized and asymmetric multi-agent learning approach for construction sites (DAMALCS). We formulate DAMALCS to reduce expected collisions for operating vehicles. Therefore, we develop two heuristic experts capable of achieving their joint goal optimally by applying an innovative prioritization method. In this approach, the bulldozer's movements take precedence over the dumper's operations, enabling the bulldozer to clear the path for the dumper and ensure continuous operation of both vehicles. Since heuristics alone are insufficient in real-world scenarios, we utilize them to train AI agents, which proves to be highly effective. We simultaneously train the bulldozer and dumper agents to operate within the same environment, aiming to avoid collisions and optimize performance in terms of time efficiency and sand volume handling. Our trained agents and heuristics are evaluated in both simulation and real-world lab experiments, testing them under various conditions, such as visual noise and localization errors. The results demonstrate that our approach significantly reduces collision rates for these vehicles.


Maneuver Decision-Making with Trajectory Streams Prediction for Autonomous Vehicles

arXiv.org Artificial Intelligence

Decision-making, motion planning, and trajectory prediction are crucial in autonomous driving systems. By accurately forecasting the movements of other road users, the decision-making capabilities of the autonomous system can be enhanced, making it more effective in responding to dynamic and unpredictable environments and more adaptive to diverse road scenarios. This paper presents the FFStreams++ approach for decision-making and motion planning of different maneuvers, including unprotected left turn, overtaking, and keep-lane. FFStreams++ is a combination of sampling-based and search-based approaches, where iteratively new sampled trajectories for different maneuvers are generated and optimized, and afterward, a heuristic search planner is called, searching for an optimal plan. We model the autonomous diving system in the Planning Domain Definition Language (PDDL) and search for the optimal plan using a heuristic Fast-Forward planner. In this approach, the initial state of the problem is modified iteratively through streams, which will generate maneuver-specific trajectory candidates, increasing the iterating level until an optimal plan is found. FFStreams++ integrates a query-connected network model for predicting possible future trajectories for each surrounding obstacle along with their probabilities. The proposed approach was tested on the CommonRoad simulation framework. We use a collection of randomly generated driving scenarios for overtaking and unprotected left turns at intersections to evaluate the FFStreams++ planner. The test results confirmed that the proposed approach can effectively execute various maneuvers to ensure safety and reduce the risk of collisions with nearby traffic agents.


Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments

arXiv.org Artificial Intelligence

Although deep reinforcement learning (DRL) approaches in audio signal processing have seen substantial progress in recent years, audio-driven DRL for tasks such as navigation, gaze control and head-orientation control in the context of human-robot interaction have received little attention. Here, we propose an audio-driven DRL framework in which we utilise deep Q-learning to develop an autonomous agent that orients towards a talker in the acoustic environment based on stereo speech recordings. Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments (that is, without reverberation). The presence of reverberation in naturalistic acoustic environments affected the agent's performance, although the agent still substantially outperformed a baseline, randomly acting agent. Finally, we quantified the degree of generalization of the proposed DRL approach across naturalistic acoustic environments. Our experiments revealed that policies learned by agents trained on medium or high reverb environments generalized to low reverb environments, but policies learned by agents trained on anechoic or low reverb environments did not generalize to medium or high reverb environments. Taken together, this study demonstrates the potential of audio-driven DRL for tasks such as head-orientation control and highlights the need for training strategies that enable robust generalization across environments for real-world audio-driven DRL applications.


Aligning Robot Navigation Behaviors with Human Intentions and Preferences

arXiv.org Artificial Intelligence

Recent advances in the field of machine learning have led to new ways for mobile robots to acquire advanced navigational capabilities. However, these learning-based methods raise the possibility that learned navigation behaviors may not align with the intentions and preferences of people, a problem known as value misalignment. To mitigate this risk, this dissertation aims to answer the question: "How can we use machine learning methods to align the navigational behaviors of autonomous mobile robots with human intentions and preferences?" First, this dissertation addresses this question by introducing a new approach to learning navigation behaviors by imitating human-provided demonstrations of the intended navigation task. This contribution allows mobile robots to acquire autonomous visual navigation capabilities through imitation, using a novel objective function that encourages the agent to align with the human's navigation objectives and penalizes misalignment. Second, this dissertation introduces two algorithms to enhance terrain-aware off-road navigation for mobile robots by learning visual terrain awareness in a self-supervised manner. This contribution enables mobile robots to respect a human operator's preferences for navigating different terrains in urban outdoor environments, while extrapolating these preferences to visually novel terrains by leveraging multi-modal representations. Finally, in the context of robot navigation in human-occupied environments, this dissertation introduces a dataset and an algorithm for robot navigation in a socially compliant manner in both indoor and outdoor environments. In summary, the contributions in this dissertation take significant steps toward addressing the value alignment problem in autonomous navigation, enabling mobile robots to navigate autonomously with objectives that align with human intentions and preferences.


A Simple HMM with Self-Supervised Representations for Phone Segmentation

arXiv.org Artificial Intelligence

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.


Conditional sampling within generative diffusion models

arXiv.org Machine Learning

As an example, when the density function of ฯ€( | y) is available (up to a constant), Markov chain Monte Carlo (MCMC, Meyn and Tweedie, 2009) methods are popular and generic algorithms widely used. The MCMC algorithms simulate a Markov chain that leaves the target distribution invariant. The drawback is that this often makes the algorithms computationally and statistically inefficient for high-dimensional problems. In this article, we discuss an emerging class of samplers that leverage generative diffusions (see, e.g., Benton et al., 2024; Song et al., 2021), which have empirically worked well for many Bayesian inverse problems. At the heart, the generative diffusions aim to find a continuos-time Markov process (e.g., stochastic differential equation) that bridges the target distribution and a reference measure, so that sampling the target simplifies to sample the reference and the Markov process. In contrast to traditional samplers such as MCMC which use the target's density function to build statistically exact samplers, the generative diffusions use the data to approximate a sampler akin to normalising flow (Chen et al., 2018; Papamakarios et al., 2021) and flow matching (Lipman et al., 2023). This comes with at least three benefits compared to MCMC: 1) scalability of the problem dimension (after the training time), 2) no need to explicitly know the target density function, 3) and the resulting samplers are embarrassingly differentiable (see a use case in Watson et al., 2022). However, the generative diffusion framework (for unconditional sampling) is not immediately applicable to conditional sampling, since we do not have the conditional data samples from ฯ€( | y) required to train the generative samplers.


Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

arXiv.org Artificial Intelligence

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.


Autonomous Goal Detection and Cessation in Reinforcement Learning: A Case Study on Source Term Estimation

arXiv.org Artificial Intelligence

Reinforcement Learning has revolutionized decision-making processes in dynamic environments, yet it often struggles with autonomously detecting and achieving goals without clear feedback signals. For example, in a Source Term Estimation problem, the lack of precise environmental information makes it challenging to provide clear feedback signals and to define and evaluate how the source's location is determined. To address this challenge, the Autonomous Goal Detection and Cessation (AGDC) module was developed, enhancing various RL algorithms by incorporating a self-feedback mechanism for autonomous goal detection and cessation upon task completion. Our method effectively identifies and ceases undefined goals by approximating the agent's belief, significantly enhancing the capabilities of RL algorithms in environments with limited feedback. To validate effectiveness of our approach, we integrated AGDC with deep Q-Network, proximal policy optimization, and deep deterministic policy gradient algorithms, and evaluated its performance on the Source Term Estimation problem. The experimental results showed that AGDC-enhanced RL algorithms significantly outperformed traditional statistical methods such as infotaxis, entrotaxis, and dual control for exploitation and exploration, as well as a non-statistical random action selection method. These improvements were evident in terms of success rate, mean traveled distance, and search time, highlighting AGDC's effectiveness and efficiency in complex, real-world scenarios.


Increasing the Value of Information During Planning in Uncertain Environments

arXiv.org Artificial Intelligence

However, on an important set of problems where there is a large time delay between when the agent can gather information and when it needs to use that information, these solutions fail to adequately consider the value of information. As a result, information gathering actions, even when they are critical in the optimal policy, will be ignored by existing solutions, leading to sub-optimal decisions by the agent. In this research, we develop a novel solution that rectifies this problem by introducing a new algorithm that improves upon state-of-the-art online planning by better reflecting on the value of actions that gather information. We do this by adding Entropy to the UCB1 heuristic in the POMCP algorithm. We test this solution on the hallway problem. Results indicate that our new algorithm performs significantly better than POMCP. We as humans instinctively gather information or ask clarifying questions when faced with task completion in uncertain situations. We know to do this because, even though we are delaying the task at hand, it is ultimately in our favour to work with complete information. Ideally, online planning algorithms like POMCP [10], whose sole job is to make plans for agents acting in uncertain situations, know to do the same. They would be able to strategically pick actions that will provide the information to best guide the agent's decision making. However, unlike humans, who can easily correlate information gain with the ease of task accomplishment, these algorithms cannot.