pdis
- North America > United States > Texas (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > United States > California > Santa Clara County (0.14)
- North America > Canada > Alberta (0.14)
- Education (0.68)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.48)
- Government > Regional Government (0.46)
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning
Chen, Claire, Liu, Shuze, Zhang, Shangtong
In reinforcement learning, classic on-policy evaluation methods often suffer from high variance and require massive online data to attain the desired accuracy. Previous studies attempt to reduce evaluation variance by searching for or designing proper behavior policies to collect data. However, these approaches ignore the safety of such behavior policies -- the designed behavior policies have no safety guarantee and may lead to severe damage during online executions. In this paper, to address the challenge of reducing variance while ensuring safety simultaneously, we propose an optimal variance-minimizing behavior policy under safety constraints. Theoretically, while ensuring safety constraints, our evaluation method is unbiased and has lower variance than on-policy evaluation. Empirically, our method is the only existing method to achieve both substantial variance reduction and safety constraint satisfaction. Furthermore, we show our method is even superior to previous methods in both variance reduction and execution safety.
- North America > Canada > Alberta (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > New Jersey (0.04)
- (5 more...)
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
Zhaohan Guo, Philip S. Thomas, Emma Brunskill
Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems. In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling. We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > Canada > Alberta (0.14)
- (5 more...)
- Education (0.68)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.48)
- Government > Regional Government (0.46)
Doubly Optimal Policy Evaluation for Reinforcement Learning
Liu, Shuze, Chen, Claire, Zhang, Shangtong
Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.
- North America > Canada > Alberta (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > New Jersey (0.04)
- (4 more...)
Efficient Multi-Policy Evaluation for Reinforcement Learning
Liu, Shuze, Chen, Yuxin, Zhang, Shangtong
To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.
- North America > Canada > Alberta (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (2 more...)
Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning Approach
Fallah, Mohammad Amir, Monemi, Mehdi, Rasti, Mehdi, Latva-Aho, Matti
3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.
- Europe > Finland > Northern Ostrobothnia > Oulu (0.04)
- Asia > Middle East > Iran > Fars Province > Shiraz (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- (6 more...)
- Research Report (0.64)
- Instructional Material (0.46)
Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
Lobo, Elita, Singh, Harvineet, Petrik, Marek, Rudin, Cynthia, Lakkaraju, Himabindu
Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive. However, the extent to which such methods can be trusted under adversarial threats to data quality is largely unexplored. In this work, we make the first attempt at investigating the sensitivity of OPE methods to marginal adversarial perturbations to the data. We design a generic data poisoning attack framework leveraging influence functions from robust statistics to carefully construct perturbations that maximize error in the policy value estimates. We carry out extensive experimentation with multiple healthcare and control datasets. Our results demonstrate that many existing OPE methods are highly prone to generating value estimates with large errors when subject to data poisoning attacks, even for small adversarial perturbations. These findings question the reliability of policy values derived using OPE methods and motivate the need for developing OPE methods that are statistically robust to train-time data poisoning attacks.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- North America > United States > New Hampshire (0.04)
- (4 more...)
- Information Technology > Security & Privacy (0.94)
- Health & Medicine > Therapeutic Area (0.79)
Improving Monte Carlo Evaluation with Offline Data
Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
- North America > Canada > Alberta (0.14)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- (6 more...)