AITopics

Traditional reinforcement learning control for quadruped robots often relies on privileged information, demanding meticulous selection and precise estimation, thereby imposing constraints on the development process. This work proposes a Self-learning Latent Representation (SLR) method, which achieves high-performance control policy learning without the need for privileged information. To enhance the credibility of our proposed method's evaluation, SLR is compared with open-source code repositories of state-of-the-art algorithms, retaining the original authors' configuration parameters. Across four repositories, SLR consistently outperforms the reference results. Ultimately, the trained policy and encoder empower the quadruped robot to navigate steps, climb stairs, ascend rocks, and traverse various challenging terrains. Robot experiment videos are at https://11chens.github.io/SLR/

artificial intelligence, machine learning, robot, (13 more...)

2406.04835

Country: Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots > Locomotion (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Jonnarth, Arvi, Johansson, Ola, Felsberg, Michael

Sim-to-real Transfer of Deep Reinforcement Learning Agents for Online Coverage Path Planning

Sim-to-real transfer presents a difficult challenge, where models trained in simulation are to be deployed in the real world. The distribution shift between the two settings leads to biased representations of the perceived real-world environment, and thus to suboptimal predictions. In this work, we tackle the challenge of sim-to-real transfer of reinforcement learning (RL) agents for coverage path planning (CPP). In CPP, the task is for a robot to find a path that visits every point of a confined area. Specifically, we consider the case where the environment is unknown, and the agent needs to plan the path online while mapping the environment. We bridge the sim-to-real gap through a semi-virtual environment with a simulated sensor and obstacles, while including real robot kinematics and real-time aspects. We investigate what level of fine-tuning is needed for adapting to a realistic setting, comparing to an agent trained solely in simulation. We find that a high model inference frequency is sufficient for reducing the sim-to-real gap, while fine-tuning degrades performance initially. By training the model in simulation and deploying it at a high inference frequency, we transfer state-of-the-art results from simulation to the real domain, where direct learning would take in the order of weeks with manual interaction, i.e., would be completely infeasible.

agent, fine-tuning, simulation, (14 more...)

2406.0492

Country:

Europe > Sweden > Östergötland County > Linköping (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Belaire, Roman, Sinha, Arunesh, Varakantham, Pradeep

Probabilistic Perspectives on Error Minimization in Adversarial Reinforcement Learning

Deep Reinforcement Learning (DRL) policies are critically vulnerable to adversarial noise in observations, posing severe risks in safety-critical scenarios. For example, a self-driving car receiving manipulated sensory inputs about traffic signs could lead to catastrophic outcomes. Existing strategies to fortify RL algorithms against such adversarial perturbations generally fall into two categories: (a) using regularization methods that enhance robustness by incorporating adversarial loss terms into the value objectives, and (b) adopting "maximin" principles, which focus on maximizing the minimum value to ensure robustness. While regularization methods reduce the likelihood of successful attacks, their effectiveness drops significantly if an attack does succeed. On the other hand, maximin objectives, although robust, tend to be overly conservative. To address this challenge, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), which naturally balances optimizing value and robustness against adversarial attacks. To optimize ACoE in a scalable manner in model-free settings, we propose a theoretically justified surrogate objective known as Cumulative-ACoE (C-ACoE). The core idea of optimizing C-ACoE is utilizing the belief about the underlying true state given the adversarially perturbed observation. Our empirical evaluations demonstrate that our method outperforms current state-of-the-art approaches for addressing adversarial RL problems across all established benchmarks (MuJoCo, Atari, and Highway) used in the literature.

adversarial perturbation, adversary, robustness, (13 more...)

2406.04724

Country:

Asia > Singapore (0.04)
North America > United States > New Jersey > Middlesex County > New Brunswick (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology > Security & Privacy (0.66)
Government > Military (0.48)
Transportation > Ground > Road (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Robust Reward Design for Markov Decision Processes

Wu, Shuo, Ma, Haoxiang, Fu, Jie, Han, Shuo

The problem of reward design examines the interaction between a leader and a follower, where the leader aims to shape the follower's behavior to maximize the leader's payoff by modifying the follower's reward function. Current approaches to reward design rely on an accurate model of how the follower responds to reward modifications, which can be sensitive to modeling inaccuracies. To address this issue of sensitivity, we present a solution that offers robustness against uncertainties in modeling the follower, including 1) how the follower breaks ties in the presence of nonunique best responses, 2) inexact knowledge of how the follower perceives reward modifications, and 3) bounded rationality of the follower. Our robust solution is guaranteed to exist under mild conditions and can be obtained numerically by solving a mixed-integer linear program. Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs.

allocation, follower, optimal interior-point allocation, (15 more...)

2406.05086

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Florida > Alachua County > Gainesville (0.14)
(9 more...)

Genre: Research Report (0.63)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.50)

When and How: Learning Identifiable Latent States for Nonstationary Time Series Forecasting

Li, Zijian, Cai, Ruichu, Yang, Zhenhui, Huang, Haiqin, Chen, Guangyi, Shen, Yifan, Chen, Zhengming, Song, Xiangchen, Zhang, Kun

Temporal distribution shifts are ubiquitous in time series data. One of the most popular methods assumes that the temporal distribution shift occurs uniformly to disentangle the stationary and nonstationary dependencies. But this assumption is difficult to meet, as we do not know when the distribution shifts occur. To solve this problem, we propose to learn IDentifiable latEnt stAtes (IDEA) to detect when the distribution shifts occur. Beyond that, we further disentangle the stationary and nonstationary latent states via sufficient observation assumption to learn how the latent states change. Specifically, we formalize the causal process with environment-irrelated stationary and environment-related nonstationary variables. Under mild conditions, we show that latent environments and stationary/nonstationary variables are identifiable. Based on these theories, we devise the IDEA model, which incorporates an autoregressive hidden Markov model to estimate latent environments and modular prior networks to identify latent states. The IDEA model outperforms several latest nonstationary forecasting methods on various benchmark datasets, highlighting its advantages in real-world scenarios.

forecasting, latent variable, learning identifiable latent state, (14 more...)

2402.12767

Country:

North America > United States > California (0.14)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > China (0.04)
(8 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

arXiv.org Machine LearningJun-7-2024

Reinforcement Learning and Regret Bounds for Admission Control

Weber, Lucas, Bušić, Ana, Zhu, Jiamin

The expected regret of any reinforcement learning algorithm is lower bounded by $\Omega\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears.

arrival rate, optimal policy, reinforcement learning and regret bound, (12 more...)

arXiv.org Machine Learning

2406.04766

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Miles, Christopher E., Webber, Robert J.

Dynamical mixture modeling with fast, automatic determination of Markov chains

arXiv.org Machine LearningJun-7-2024

Markov state modeling has gained popularity in various scientific fields due to its ability to reduce complex time series data into transitions between a few states. Yet, current frameworks are limited by assuming a single Markov chain describes the data, and they suffer an inability to discern heterogeneities. As a solution, this paper proposes a variational expectation-maximization algorithm that identifies a mixture of Markov chains in a time-series data set. The method is agnostic to the definition of the Markov states, whether data-driven (e.g. by spectral clustering) or based on domain knowledge. Variational EM efficiently and organically identifies the number of Markov chains and dynamics of each chain without expensive model comparisons or posterior sampling. The approach is supported by a theoretical analysis and numerical experiments, including simulated and observational data sets based on ${\tt Last.fm}$ music listening, ultramarathon running, and gene expression. The results show the new algorithm is competitive with contemporary mixture modeling approaches and powerful in identifying meaningful heterogeneities in time series data.

algorithm, markov chain, trajectory, (16 more...)

arXiv.org Machine Learning

2406.04653

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)
Oceania > New Zealand (0.04)
(5 more...)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Deng, Bangchao, Qu, Bingqing, Wang, Pengyang, Yang, Dingqi, Fankhauser, Benjamin, Cudre-Mauroux, Philippe

REPLAY: Modeling Time-Varying Temporal Regularities of Human Mobility for Location Prediction over Sparse Trajectories

arXiv.org Artificial IntelligenceJun-6-2024

Location prediction forecasts a user's location based on historical user mobility traces. To tackle the intrinsic sparsity issue of real-world user mobility traces, spatiotemporal contexts have been shown as significantly useful. Existing solutions mostly incorporate spatiotemporal distances between locations in mobility traces, either by feeding them as additional inputs to Recurrent Neural Networks (RNNs) or by using them to search for informative past hidden states for prediction. However, such distance-based methods fail to capture the time-varying temporal regularities of human mobility, where human mobility is often more regular in the morning than in other periods, for example; this suggests the usefulness of the actual timestamps besides the temporal distances. Against this background, we propose REPLAY, a general RNN architecture learning to capture the time-varying temporal regularities for location prediction. Specifically, REPLAY not only resorts to the spatiotemporal distances in sparse trajectories to search for the informative past hidden states, but also accommodates the time-varying temporal regularities by incorporating smoothed timestamp embeddings using Gaussian weighted averaging with timestamp-specific learnable bandwidths, which can flexibly adapt to the temporal regularities of different strengths across different timestamps. Our extensive evaluation compares REPLAY against a sizable collection of state-of-the-art techniques on two real-world datasets. Results show that REPLAY consistently and significantly outperforms state-of-the-art methods by 7.7\%-10.9\% in the location prediction task, and the bandwidths reveal interesting patterns of the time-varying temporal regularities.

prediction, regularity, timestamp, (15 more...)

2402.1631

Country:

Asia > Macao (0.14)
Europe > Switzerland > Fribourg > Fribourg (0.04)
North America > United States (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Promising Solution (0.54)

Industry: Information Technology (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
(4 more...)

arXiv.org Artificial IntelligenceJun-6-2024

On Limitation of Transformer for Learning HMMs

Hu, Jiachen, Liu, Qinghua, Jin, Chi

Despite the remarkable success of Transformer-based architectures in various sequential modeling tasks, such as natural language processing, computer vision, and robotics, their ability to learn basic sequential models, like Hidden Markov Models (HMMs), is still unclear. This paper investigates the performance of Transformers in learning HMMs and their variants through extensive experimentation and compares them to Recurrent Neural Networks (RNNs). We show that Transformers consistently underperform RNNs in both training speed and testing accuracy across all tested HMM models. There are even challenging HMM instances where Transformers struggle to learn, while RNNs can successfully do so. Our experiments further reveal the relation between the depth of Transformers and the longest sequence length it can effectively learn, based on the types and the complexity of HMMs. To address the limitation of transformers in modeling HMMs, we demonstrate that a variant of the Chain-of-Thought (CoT), called $\textit{block CoT}$ in the training phase, can help transformers to reduce the evaluation error and to learn longer sequences at a cost of increasing the training time. Finally, we complement our empirical findings by theoretical results proving the expressiveness of transformers in approximating HMMs with logarithmic depth.

hmm, sequence, transformer, (15 more...)

2406.04089

Country:

Europe > Greece > Attica > Athens (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

arXiv.org Artificial IntelligenceJun-6-2024

CityLight: A Universal Model Towards Real-world City-scale Traffic Signal Control Coordination

Zeng, Jinwei, Yu, Chao, Yang, Xinyi, Ao, Wenxuan, Yuan, Jian, Li, Yong, Wang, Yu, Yang, Huazhong

Traffic signal control (TSC) is a promising low-cost measure to enhance transportation efficiency without affecting existing road infrastructure. While various reinforcement learning-based TSC methods have been proposed and experimentally outperform conventional rule-based methods, none of them has been deployed in the real world. An essential gap lies in the oversimplification of the scenarios in terms of intersection heterogeneity and road network intricacy. To make TSC applicable in urban traffic management, we target TSC coordination in city-scale high-authenticity road networks, aiming to solve the three unique and important challenges: city-level scalability, heterogeneity of real-world intersections, and effective coordination among intricate neighbor connections. Since optimizing multiple agents in a parameter-sharing paradigm can boost the training efficiency and help achieve scalability, we propose our method, CityLight, based on the well-acknowledged optimization framework, parameter-sharing MAPPO. To ensure the unified policy network can learn to fit large-scale heterogeneous intersections and tackle the intricate between-neighbor coordination, CityLight proposes a universal representation module that consists of two key designs: heterogeneous intersection alignment and neighborhood impact alignment for coordination. To further boost coordination, CityLight adopts neighborhood-integrated rewards to transition from achieving local optimal to global optimal. Extensive experiments on datasets with hundreds to tens of thousands of real-world intersections and authentic traffic demands validate the surprising effectiveness and generalizability of CityLight, with an overall performance gain of 11.66% and a 22.59% improvement in transfer scenarios in terms of throughput.

citylight, dataset, intersection, (16 more...)

2406.02126

Country:

Asia > China > Beijing > Beijing (0.07)
North America > United States > New York > New York County > Manhattan (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)