AITopics | Zhou, Dengyong

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Tang, Ziyang, Feng, Yihao, Li, Lihong, Zhou, Dengyong, Liu, Qiang

arXiv.org Artificial IntelligenceOct-16-2019

Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high biases due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods.

artificial intelligence, estimator, optimization problem, (18 more...)

arXiv.org Artificial Intelligence

1910.07186

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Liu, Qiang, Li, Lihong, Tang, Ziyang, Zhou, Dengyong

Neural Information Processing SystemsDec-31-2018

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

machine learning, reinforcement learning, trajectory, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Texas > Travis County > Austin (0.14)

Genre: Research Report (0.34)

Industry: Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)

Add feedback

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Liu, Qiang, Li, Lihong, Tang, Ziyang, Zhou, Dengyong

Neural Information Processing SystemsDec-31-2018

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

artificial intelligence, behavior policy, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Texas > Travis County > Austin (0.14)

Genre: Research Report (0.34)

Industry: Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)

Add feedback

Neural Phrase-to-Phrase Machine Translation

Feng, Jiangtao, Kong, Lingpeng, Huang, Po-Sen, Wang, Chong, Huang, Da, Mao, Jiayuan, Qiao, Kan, Zhou, Dengyong

arXiv.org Machine LearningNov-6-2018

In recent years, we have witnessed the surge of neural sequence to sequence (seq2seq) models (Bah-danau et al., 2014; Sutskever et al., 2014). Gehring et al., 2017) and training techniques (V aswani et al., 2017; Ba et al., 2016) keep advancing Until recently, Huang et al. (2018) developed Neural Phrase-based Machine Translation This work was done when Jiangtao and Jiayuan interned in Google. We use "··· " to indicate all the possible segmentsx In our model, given the phrase-level attentions, we develop a dictionary lookup decoding method with an external phrase-to-phrase dictionary. We show how it avoids the more costly dynamic programming used in NPMT (Huang et al., For segment indexn 1,..., (a) Update the attention state given all previous segments, a Similar to NPMT in Huang et al. (2018), direct computing Eq. (5) is intractable. We also need to develop a dynamic programming algorithms to efficiently compute the loss function.

deep learning, neural network, np 2, (16 more...)

arXiv.org Machine Learning

1811.02172

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Liu, Qiang, Li, Lihong, Tang, Ziyang, Zhou, Dengyong

arXiv.org Artificial IntelligenceOct-29-2018

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

artificial intelligence, estimator, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1810.12429

Country: North America > United States > Texas > Travis County > Austin (0.14)

Genre: Research Report (0.84)

Industry: Transportation (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)

Add feedback

Action-depedent Control Variates for Policy Optimization via Stein's Identity

Liu, Hao, Feng, Yihao, Mao, Yi, Zhou, Dengyong, Peng, Jian, Liu, Qiang

arXiv.org Machine LearningFeb-23-2018

Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.

control variate, deep learning, neural network, (17 more...)

arXiv.org Machine Learning

1710.11198

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Washington > King County (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

On the Discrimination-Generalization Tradeoff in GANs

Zhang, Pengchuan, Liu, Qiang, Zhou, Dengyong, Xu, Tao, He, Xiaodong

arXiv.org Machine LearningFeb-23-2018

Generative adversarial training can be generally understood as minimizing certain moment matching loss defined by a set of discriminator functions, typically neural networks. The discriminator set should be large enough to be able to uniquely identify the true distribution (discriminative), and also be small enough to go beyond memorizing samples (generalizable). In this paper, we show that a discriminator set is guaranteed to be discriminative whenever its linear span is dense in the set of bounded continuous functions. This is a very mild condition satisfied even by neural networks with a single neuron. Further, we develop generalization bounds between the learned distribution and true distribution under different evaluation metrics. When evaluated with neural distance, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set. When evaluated with KL divergence, our bound provides an explanation on the counter-intuitive behaviors of testing likelihood in GAN training. Our analysis sheds lights on understanding the practical performance of GANs.

artificial intelligence, discriminator, neural network, (15 more...)

arXiv.org Machine Learning

1711.02771

Country: North America > United States > Virginia (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)

Add feedback

Towards Neural Phrase-based Machine Translation

Huang, Po-Sen, Wang, Chong, Huang, Sitao, Zhou, Dengyong, Deng, Li

arXiv.org Machine LearningJan-29-2018

In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.

deep learning, neural network, reordering layer, (21 more...)

arXiv.org Machine Learning

1706.05565

Country: Europe (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Sequence Modeling via Segmentations

Wang, Chong, Wang, Yining, Huang, Po-Sen, Mohamed, Abdelrahman, Zhou, Dengyong, Deng, Li

arXiv.org Machine LearningJun-19-2017

Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts.

deep learning, neural network, segmentation, (18 more...)

arXiv.org Machine Learning

1702.07463

Country: Oceania > Australia (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Provably Optimal Algorithms for Generalized Linear Contextual Bandits

Li, Lihong, Lu, Yu, Zhou, Dengyong

arXiv.org Artificial IntelligenceJun-18-2017

Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search. Generalized linear models (logistical regression in particular) have demonstrated stronger performance than linear models in many applications where rewards are binary. However, most theoretical analyses on contextual bandits so far are on linear bandits. In this work, we propose an upper confidence bound based algorithm for generalized linear contextual bandits, which achieves an $\tilde{O}(\sqrt{dT})$ regret over $T$ rounds with $d$ dimensional feature vectors. This regret matches the minimax lower bound, up to logarithmic terms, and improves on the best previous result by a $\sqrt{d}$ factor, assuming the number of arms is fixed. A key component in our analysis is to establish a new, sharp finite-sample confidence bound for maximum-likelihood estimates in generalized linear models, which may be of independent interest. We also analyze a simpler upper confidence bound algorithm, which is useful in practice, and prove it to have optimal regret for certain cases.

algorithm, artificial intelligence, health & medicine, (17 more...)

arXiv.org Artificial Intelligence

1703.00048

Country: North America > United States (0.46)

Industry: Health & Medicine (0.30)

Technology: