AITopics

System optimal traffic routing can mitigate congestion by assigning routes for a portion of vehicles so that the total travel time of all vehicles in the transportation system can be reduced. However, achieving real-time optimal routing poses challenges due to uncertain demands and unknown system dynamics, particularly in expansive transportation networks. While physics model-based methods are sensitive to uncertainties and model mismatches, model-free reinforcement learning struggles with learning inefficiencies and interpretability issues. Our paper presents TransRL, a novel algorithm that integrates reinforcement learning with physics models for enhanced performance, reliability, and interpretability. TransRL begins by establishing a deterministic policy grounded in physics models, from which it learns from and is guided by a differentiable and stochastic teacher policy. During training, TransRL aims to maximize cumulative rewards while minimizing the Kullback Leibler (KL) divergence between the current policy and the teacher policy. This approach enables TransRL to simultaneously leverage interactions with the environment and insights from physics models. We conduct experiments on three transportation networks with up to hundreds of links. The results demonstrate TransRL's superiority over traffic model-based methods for being adaptive and learning from the actual network data. By leveraging the information from physics models, TransRL consistently outperforms state-of-the-art reinforcement learning algorithms such as proximal policy optimization (PPO) and soft actor critic (SAC). Moreover, TransRL's actions exhibit higher reliability and interpretability compared to baseline reinforcement learning approaches like PPO and SAC.

reinforcement, scenario, transrl, (15 more...)

2407.07364

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Minnesota (0.04)
North America > United States > Maryland (0.04)
(4 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Transportation > Passenger (1.00)
Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Schulte, Oliver, Poupart, Pascal

Why Online Reinforcement Learning is Causal

Reinforcement learning (RL) and causal modelling naturally complement each other. The goal of causal modelling is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper examines which reinforcement learning settings we can expect to benefit from causal modelling, and how. In online learning, the agent has the ability to interact directly with their environment, and learn from exploring it. Our main argument is that in online learning, conditional probabilities are causal, and therefore offline RL is the setting where causal learning has the most potential to make a difference. Essentially, the reason is that when an agent learns from their {\em own} experience, there are no unobserved confounders that influence both the agent's own exploratory actions and the rewards they receive. Our paper formalizes this argument. For offline RL, where an agent may and typically does learn from the experience of {\em others}, we describe previous and new methods for leveraging a causal model, including support for counterfactual queries.

agent, conditional probability, probability, (15 more...)

2403.04221

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
(3 more...)

Genre:

Research Report (1.00)
Instructional Material > Online (0.40)

Industry:

Education (0.68)
Leisure & Entertainment > Sports > Hockey (0.68)
Leisure & Entertainment > Games (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.94)

Resource Allocation for Twin Maintenance and Computing Task Processing in Digital Twin Vehicular Edge Computing Network

Xie, Yu, Wu, Qiong, Fan, Pingyi, Cheng, Nan, Chen, Wen, Wang, Jiangzhou, Letaief, Khaled B.

As a promising technology, vehicular edge computing (VEC) can provide computing and caching services by deploying VEC servers near vehicles. However, VEC networks still face challenges such as high vehicle mobility. Digital twin (DT), an emerging technology, can predict, estimate, and analyze real-time states by digitally modeling objects in the physical world. By integrating DT with VEC, a virtual vehicle DT can be created in the VEC server to monitor the real-time operating status of vehicles. However, maintaining the vehicle DT model requires ongoing attention from the VEC server, which also needs to offer computing services for the vehicles. Therefore, effective allocation and scheduling of VEC server resources are crucial. This study focuses on a general VEC network with a single VEC service and multiple vehicles, examining the two types of delays caused by twin maintenance and computational processing within the network. By transforming the problem using satisfaction functions, we propose an optimization problem aimed at maximizing each vehicle's resource utility to determine the optimal resource allocation strategy. Given the non-convex nature of the issue, we employ multi-agent Markov decision processes to reformulate the problem. Subsequently, we propose the twin maintenance and computing task processing resource collaborative scheduling (MADRL-CSTC) algorithm, which leverages multi-agent deep reinforcement learning. Through experimental comparisons with alternative algorithms, it demonstrates that our proposed approach is effective in terms of resource allocation.

algorithm, server, vehicle, (14 more...)

2407.07575

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)
(10 more...)

Genre: Research Report (0.64)

Industry:

Information Technology (0.93)
Telecommunications (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Khoushabar, MohammadAmin Ansari, Ghafariasl, Parviz

Advanced Meta-Ensemble Machine Learning Models for Early and Accurate Sepsis Prediction to Improve Patient Outcomes

Sepsis, a critical condition from the body's response to infection, poses a major global health crisis affecting all age groups. Timely detection and intervention are crucial for reducing healthcare expenses and improving patient outcomes. This paper examines the limitations of traditional sepsis screening tools like Systemic Inflammatory Response Syndrome, Modified Early Warning Score, and Quick Sequential Organ Failure Assessment, highlighting the need for advanced approaches. We propose using machine learning techniques - Random Forest, Extreme Gradient Boosting, and Decision Tree models - to predict sepsis onset. Our study evaluates these models individually and in a combined meta-ensemble approach using key metrics such as Accuracy, Precision, Recall, F1 score, and Area Under the Receiver Operating Characteristic Curve. Results show that the meta-ensemble model outperforms individual models, achieving an AUC-ROC score of 0.96, indicating superior predictive accuracy for early sepsis detection. The Random Forest model also performs well with an AUC-ROC score of 0.95, while Extreme Gradient Boosting and Decision Tree models score 0.94 and 0.90, respectively.

base model, prediction, sepsis, (16 more...)

2407.08107

Country:

Asia > Middle East > Israel (0.05)
North America > United States > Virginia (0.04)
North America > United States > Kansas > Riley County > Manhattan (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

arXiv.org Machine LearningJul-10-2024

Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

Zhang, Dake, Lyu, Boxiang, Qiu, Shuang, Kolar, Mladen, Zhang, Tong

We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in understanding how to efficiently derive a near-optimal policy based on this risk measure using only a pre-collected dataset. We center on the linear Markov Decision Process (MDP) setting, a well-regarded theoretical framework that has yet to be examined from a risk-sensitive standpoint. In response, we introduce two provably sample-efficient algorithms. We begin by presenting a risk-sensitive pessimistic value iteration algorithm, offering a tight analysis by leveraging the structure of the risk-sensitive performance measure. To further improve the obtained bounds, we propose another pessimistic algorithm that utilizes variance information and reference-advantage decomposition, effectively improving both the dependence on the space dimension $d$ and the risk-sensitivity factor. To the best of our knowledge, we obtain the first provably efficient risk-sensitive offline RL algorithms.

algorithm, pessimism meet risk, theorem 5, (10 more...)

arXiv.org Machine Learning

2407.07631

Country:

Europe > Austria > Vienna (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Infante, Guillermo, Jonsson, Anders, Gómez, Vicenç

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

We introduce a novel approach to hierarchical reinforcement learning for Linearly-solvable Markov Decision Processes (LMDPs) in the infinite-horizon average-reward setting. Unlike previous work, our approach allows learning low-level and high-level tasks simultaneously, without imposing limiting restrictions on the low-level tasks. Our method relies on partitions of the state space that create smaller subtasks that are easier to solve, and the equivalence between such partitions to learn more efficiently. We then exploit the compositionality of low-level tasks to exactly represent the value function of the high-level task. Experiments show that our approach can outperform flat average-reward reinforcement learning by one or several orders of magnitude.

lmdp, subtask, value function, (14 more...)

2407.0669

Country:

North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Mustafin, Arsenii, Pakharev, Aleksei, Olshevsky, Alex, Paschalidis, Ioannis Ch.

MDP Geometry, Normalization and Value Free Solvers

Markov Decision Process (MDP) is a common mathematical model for sequential decision-making problems. In this paper, we present a new geometric interpretation of MDP, which is useful for analyzing the dynamics of main MDP algorithms. Based on this interpretation, we demonstrate that MDPs can be split into equivalence classes with indistinguishable algorithm dynamics. The related normalization procedure allows for the design of a new class of MDP-solving algorithms that find optimal policies without computing policy values.

algorithm, mdp, vector, (14 more...)

2407.06712

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > France > Nouvelle-Aquitaine > Gironde > Bordeaux (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Giral, Francisco, Gómez, Ignacio, Clainche, Soledad Le

Intercepting Unauthorized Aerial Robots in Controlled Airspace Using Reinforcement Learning

The proliferation of unmanned aerial vehicles (UAVs) in controlled airspace presents significant risks, including potential collisions, disruptions to air traffic, and security threats. Ensuring the safe and efficient operation of airspace, particularly in urban environments and near critical infrastructure, necessitates effective methods to intercept unauthorized or non-cooperative UAVs. This work addresses the critical need for robust, adaptive systems capable of managing such threats through the use of Reinforcement Learning (RL). We present a novel approach utilizing RL to train fixed-wing UAV pursuer agents for intercepting dynamic evader targets. Our methodology explores both model-based and model-free RL algorithms, specifically DreamerV3, Truncated Quantile Critics (TQC), and Soft Actor-Critic (SAC). The training and evaluation of these algorithms were conducted under diverse scenarios, including unseen evasion strategies and environmental perturbations. Our approach leverages high-fidelity flight dynamics simulations to create realistic training environments. This research underscores the importance of developing intelligent, adaptive control systems for UAV interception, significantly contributing to the advancement of secure and efficient airspace management. It demonstrates the potential of RL to train systems capable of autonomously achieving these critical tasks.

agent, algorithm, unauthorized aerial robot, (12 more...)

2407.06909

Country:

North America > United States (0.28)
Europe > Spain > Galicia > Madrid (0.05)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Air (1.00)
Aerospace & Defense (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Lu, Michael, Aghaei, Matin, Raj, Anant, Vaswani, Sharan

We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective's smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of the resulting algorithm. We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities. For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise. Finally, we empirically compare the proposed methods to PG approaches that require oracle knowledge, and demonstrate competitive performance.

conference paper, convergence, rlc 2024, (15 more...)

2405.13136

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningJul-9-2024

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Allen, Cameron, Kirtland, Aaron, Tao, Ruo Yu, Lobel, Sam, Scott, Daniel, Petrocelli, Nicholas, Gottesman, Omer, Parr, Ronald, Littman, Michael L., Konidaris, George

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to--or knowledge of--an underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($\lambda$) with a different value of $\lambda$. Since TD($\lambda$=0) makes an implicit Markov assumption and TD($\lambda$=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $\lambda$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.

agent, algorithm, pomdp, (16 more...)

arXiv.org Machine Learning

2407.07333

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)