Hsieh, Ping-Chun


Developing Multi-Task Recommendations with Long-Term Rewards via Policy Distilled Reinforcement Learning

arXiv.org Machine Learning

With the explosive growth of online products and content, recommendation techniques have been considered as an effective tool to overcome information overload, improve user experience, and boost business revenue. In recent years, we have observed a new desideratum of considering long-term rewards of multiple related recommendation tasks simultaneously. The consideration of long-term rewards is strongly tied to business revenue and growth. Learning multiple tasks simultaneously could generally improve the performance of individual task due to knowledge sharing in multi-task learning. While a few existing works have studied long-term rewards in recommendations, they mainly focus on a single recommendation task. In this paper, we propose {\it PoDiRe}: a \underline{po}licy \underline{di}stilled \underline{re}commender that can address long-term rewards of recommendations and simultaneously handle multiple recommendation tasks. This novel recommendation solution is based on a marriage of deep reinforcement learning and knowledge distillation techniques, which is able to establish knowledge sharing among different tasks and reduce the size of a learning model. The resulting model is expected to attain better performance and lower response latency for real-time recommendation services. In collaboration with Samsung Game Launcher, one of the world's largest commercial mobile game platforms, we conduct a comprehensive experimental study on large-scale real data with hundreds of millions of events and show that our solution outperforms many state-of-the-art methods in terms of several standard evaluation metrics.


Bandit Learning Through Biased Maximum Likelihood Estimation

arXiv.org Machine Learning

We propose BMLE, a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation method originally appearing in the adaptive control literature. We design the cost-bias term to tackle the exploration and exploitation tradeoff for stochastic bandit problems. We provide an explicit closed form expression for the index of an arm for Bernoulli bandits, which is trivial to compute. We also provide a general recipe for extending the BMLE algorithm to other families of reward distributions. We prove that for Bernoulli bandits, the BMLE algorithm achieves a logarithmic finite-time regret bound and hence attains order-optimality. Through extensive simulations, we demonstrate that the proposed algorithms achieve regret performance comparable to the best of several state-of-the-art baseline methods, while having a significant computational advantage in comparison to other best performing methods. The generality of the proposed approach makes it possible to address more complex models, including general adaptive control of Markovian systems.


Streaming Network Embedding through Local Actions

arXiv.org Machine Learning

Recently, considerable research attention has been paid to network embedding, a popular approach to construct feature vectors of vertices. Due to the curse of dimensionality and sparsity in graphical datasets, this approach has become indispensable for machine learning tasks over large networks. The majority of existing literature has considered this technique under the assumption that the network is static. However, networks in many applications, nodes and edges accrue to a growing network as a streaming. A small number of very recent results have addressed the problem of embedding for dynamic networks. However, they either rely on knowledge of vertex attributes, suffer high-time complexity or need to be re-trained without closed-form expression. Thus the approach of adapting the existing methods to the streaming environment faces non-trivial technical challenges. These challenges motivate developing new approaches to the problems of streaming network embedding. In this paper, We propose a new framework that is able to generate latent features for new vertices with high efficiency and low complexity under specified iteration rounds. We formulate a constrained optimization problem for the modification of the representation resulting from a stream arrival. We show this problem has no closed-form solution and instead develop an online approximation solution. Our solution follows three steps: (1) identify vertices affected by new vertices, (2) generate latent features for new vertices, and (3) update the latent features of the most affected vertices. The generated representations are provably feasible and not far from the optimal ones in terms of expectation. Multi-class classification and clustering on five real-world networks demonstrate that our model can efficiently update vertex representations and simultaneously achieve comparable or even better performance.


Heteroscedastic Bandits with Reneging

arXiv.org Machine Learning

Although shown to be useful in many areas as models for solving sequential decision problems with side observations (contexts), contextual bandits are subject to two major limitations. First, they neglect user "reneging" that occurs in real-world applications. That is, users unsatisfied with an interaction quit future interactions forever. Second, they assume that the reward distribution is homoscedastic, which is often invalidated by real-world datasets, e.g., datasets from finance. We propose a novel model of "heteroscedastic contextual bandits with reneging" to overcome the two limitations. Our model allows each user to have a distinct "acceptance level," with any interaction falling short of that level resulting in that user reneging. It also allows the variance to be a function of context. We develop a UCB-type of policy, called HR-UCB, and prove that with high probability it achieves $\mathcal{O}\Big(\sqrt{{T}}\big(\log({T})\big)^{3/2}\Big)$ regret.