Wang, Liwei
Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game
Xiong, Wei, Zhong, Han, Shi, Chengshuai, Shen, Cong, Wang, Liwei, Zhang, Tong
Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose a new pessimism-based algorithm for offline linear MDP. At the core of our algorithm is the uncertainty decomposition via a reference function, which is new in the literature of offline RL under linear function approximation. Theoretical analysis demonstrates that our algorithm can match the performance lower bound up to logarithmic factors. We also extend our techniques to the two-player zero-sum Markov games (MGs), and establish a new performance lower bound for MGs, which tightens the existing result, and verifies the nearly minimax optimality of the proposed algorithm. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.
Rethinking the Expressive Power of GNNs via Graph Biconnectivity
Zhang, Bohang, Luo, Shengjie, Wang, Liwei, He, Di
Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.
Learning Physics-Informed Neural Networks without Stacked Back-propagation
He, Di, Li, Shanda, Shi, Wenlei, Gao, Xiaotian, Zhang, Jia, Bian, Jiang, Wang, Liwei, Liu, Tie-Yan
Physics-Informed Neural Network (PINN) has become a commonly used machine learning approach to solve partial differential equations (PDE). But, facing high-dimensional secondorder PDE problems, PINN will suffer from severe scalability issues since its loss includes second-order derivatives, the computational cost of which will grow along with the dimension during stacked back-propagation. In this work, we develop a novel approach that can significantly accelerate the training of Physics-Informed Neural Networks. In particular, we parameterize the PDE solution by the Gaussian smoothed model and show that, derived from Stein's Identity, the second-order derivatives can be efficiently calculated without back-propagation. We further discuss the model capacity and provide variance reduction methods to address key limitations in the derivative estimation. Experimental results show that our proposed method can achieve competitive error compared to standard PINN training but is significantly faster. Our code is released at https://github.com/LithiumDA/PINN-without-Stacked-BP.
Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret
Zhong, Han, Hu, Jiachen, Xue, Yecheng, Li, Tongyang, Wang, Liwei
While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling
Tian, Keyu, Jiang, Yi, Diao, Qishuai, Lin, Chen, Wang, Liwei, Yuan, Zehuan
We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, randomly masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method, called Sparse masKed modeling (SparK), is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger networks. All this evidence reveals a promising future of generative pre-training on convnets. The pretrain-finetune paradigm in natural language processing (NLP), popularized by BERT (Devlin et al., 2018; Dong et al., 2019; Clark et al., 2020) and GPT (Radford et al., 2019; Brown et al., 2020), is remarkably effective and thus long envied by our vision community. It is the emerging masked image modeling (Bao et al., 2021; He et al., 2021; Xie et al., 2021; Chen et al., 2022) initially extends the success of BERT from language transformers to vision transformers (ViTs).
Is $L^2$ Physics-Informed Loss Always Suitable for Training Physics-Informed Neural Network?
Wang, Chuwei, Li, Shanda, He, Di, Wang, Liwei
The Physics-Informed Neural Network (PINN) approach is a new and promising way to solve partial differential equations using deep learning. The $L^2$ Physics-Informed Loss is the de-facto standard in training Physics-Informed Neural Networks. In this paper, we challenge this common practice by investigating the relationship between the loss function and the approximation quality of the learned solution. In particular, we leverage the concept of stability in the literature of partial differential equation to study the asymptotic behavior of the learned solution as the loss approaches zero. With this concept, we study an important class of high-dimensional non-linear PDEs in optimal control, the Hamilton-Jacobi-Bellman(HJB) Equation, and prove that for general $L^p$ Physics-Informed Loss, a wide class of HJB equation is stable only if $p$ is sufficiently large. Therefore, the commonly used $L^2$ loss is not suitable for training PINN on those equations, while $L^{\infty}$ loss is a better choice. Based on the theoretical insight, we develop a novel PINN training algorithm to minimize the $L^{\infty}$ loss for HJB equations which is in a similar spirit to adversarial training. The effectiveness of the proposed algorithm is empirically demonstrated through experiments. Our code is released at https://github.com/LithiumDA/L_inf-PINN.
Discrimination, calibration, and point estimate accuracy of GRU-D-Weibull architecture for real-time individualized endpoint prediction
Ruan, Xiaoyang, Wang, Liwei, Mai, Michelle, Thongprayoon, Charat, Cheungpasitporn, Wisit, Liu, Hongfang
Real-time individual endpoint prediction has always been a challenging task but of great clinic utility for both patients and healthcare providers. GRU-D-Weibull has a maximum C-index of 0.77 at 4.3 years of follow-up, compared to 0.68 achieved by competing models. The average absolute L1-loss of GRU-D-Weibull is around one year, with a minimum of 40% Parkes' serious error after index date. GRU-D-Weibull is not calibrated and significantly underestimates true survival probability. Feature importance tests indicate blood pressure becomes increasingly important during follow-up, while eGFR and blood albumin are less important. Most continuous features have non-linear/parabola impact on predicted survival time, and the results are generally consistent with existing knowledge. GRU-D-Weibull as a semi-parametric temporal model shows advantages in built-in parameterization of missingness, native support for asynchronously arrived measurement, capability of output both probability and point estimates at arbitrary time point for arbitrary prediction horizon, improved discrimination and point estimate accuracy after incorporating newly arrived data. Further research on its performance with more comprehensive input features, in-process or post-process calibration are warranted to benefit CKD4 or alike terminally-ill patients. Author Contribution: XR performed data analysis and manuscript writing. LW performed data extraction, curation, and proof-reading. CT and WC provided expert opinion on selection of study population, explanation of observations, and proof-reading.
Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets
Zhong, Han, Xiong, Wei, Tan, Jiyuan, Wang, Liwei, Zhang, Tong, Wang, Zhaoran, Yang, Zhuoran
We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
Wu, Tianhao, Yang, Yunchang, Zhong, Han, Wang, Liwei, Du, Simon S., Jiao, Jiantao
Reinforcement Learning (RL) has achieved phenomenal successes in solving complex sequential decisionmaking problems (Silver et al., 2016, 2017; Levine et al., 2016; Gu et al., 2017). Most of these empirical successes are driven by policy-based (policy optimization) methods, such as policy gradient (Sutton et al., 1999), natural policy gradient (NPG) (Kakade, 2001), trust region policy optimization (TRPO) (Schulman et al., 2015), and proximal policy optimization (PPO) (Schulman et al., 2017). For example, Haarnoja et al. (2018) proposed a policy-based state-of-the-art reinforcement learning algorithm, soft actor-critic (SAC), which outperformed value-based methods in a variety of real world robotics tasks including manipulation and locomotion. In fact, Kalashnikov et al. (2018) observed that compared with value-based methods such as Q-learning, policy-based methods work better with dense reward. On the other hand, for sparse reward cases in robotics, value-based methods perform better. Motivated by this, a line of recent work (Fazel et al., 2018; Bhandari and Russo, 2019; Liu et al., 2019; Wang et al., 2019; Agarwal et al., 2021) provides global convergence guarantees for these popular policybased methods. However, to achieve this goal, they made several assumptions.
Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis
Jin, Jikai, Zhang, Bohang, Wang, Haiyang, Wang, Liwei
Distributionally robust optimization (DRO) is a widely-used approach to learn models that are robust against distribution shift. Compared with the standard optimization setting, the objective function in DRO is more difficult to optimize, and most of the existing theoretical results make strong assumptions on the loss function. In this work we bridge the gap by studying DRO algorithms for general smooth non-convex losses. By carefully exploiting the specific form of the DRO objective, we are able to provide non-asymptotic convergence guarantees even though the objective function is possibly non-convex, non-smooth and has unbounded gradient noise. In particular, we prove that a special algorithm called the mini-batch normalized gradient descent with momentum, can find an $\epsilon$ first-order stationary point within $O( \epsilon^{-4} )$ gradient complexity. We also discuss the conditional value-at-risk (CVaR) setting, where we propose a penalized DRO objective based on a smoothed version of the CVaR that allows us to obtain a similar convergence guarantee. We finally verify our theoretical results in a number of tasks and find that the proposed algorithm can consistently achieve prominent acceleration.