Not enough data to create a plot.
Try a different view from the menu above.
Wang, Xingxing
Safe Offline Reinforcement Learning with Real-Time Budget Constraints
Lin, Qian, Tang, Bo, Wu, Zifan, Yu, Chao, Mao, Shangqin, Xie, Qianlong, Wang, Xingxing, Wang, Dong
Many safe RL approaches have been proposed in the past few years (Achiam et al., Aiming at promoting the safe real-world deployment 2017; Zhang et al., 2020; Sootla et al., 2022; Liu et al., of Reinforcement Learning (RL), research 2022a). Unfortunately, most existing approaches only target on safe RL has made significant progress in recent at the online setting, where potentially risky constraint years. However, most existing works in the violations can be incurred during interactions with the real literature still focus on the online setting where environment. As a kind of data-driven methods, offline risky violations of the safety budget are likely to RL (Levine et al., 2020) aims to derive a policy from offline be incurred during training. Besides, in many realworld data without further real-world exploration, and thus is particularly applications, the learned policy is required suitable for safety-critical applications. Despite the to respond to dynamically determined safety budgets recent progress in the offline RL literature (Fujimoto et al., (i.e., constraint threshold) in real time. In 2019; Kumar et al., 2020; Fujimoto & Gu, 2021), however, this paper, we target at the above real-time budget there are still limited works focusing on attaining a safe constraint problem under the offline setting, policy under the offline setting.
MDDL: A Framework for Reinforcement Learning-based Position Allocation in Multi-Channel Feed
Shi, Xiaowen, Wang, Ze, Cai, Yuanying, Wu, Xiaoxu, Yang, Fan, Liao, Guogang, Wang, Yongkang, Wang, Xingxing, Wang, Dong
Nowadays, the mainstream approach in position allocation system is to utilize a reinforcement learning model to allocate appropriate locations for items in various channels and then mix them into the feed. There are two types of data employed to train reinforcement learning (RL) model for position allocation, named strategy data and random data. Strategy data is collected from the current online model, it suffers from an imbalanced distribution of state-action pairs, resulting in severe overestimation problems during training. On the other hand, random data offers a more uniform distribution of state-action pairs, but is challenging to obtain in industrial scenarios as it could negatively impact platform revenue and user experience due to random exploration. As the two types of data have different distributions, designing an effective strategy to leverage both types of data to enhance the efficacy of the RL model training has become a highly challenging problem. In this study, we propose a framework named Multi-Distribution Data Learning (MDDL) to address the challenge of effectively utilizing both strategy and random data for training RL models on mixed multi-distribution data. Specifically, MDDL incorporates a novel imitation learning signal to mitigate overestimation problems in strategy data and maximizes the RL signal for random data to facilitate effective learning. In our experiments, we evaluated the proposed MDDL framework in a real-world position allocation system and demonstrated its superior performance compared to the previous baseline. MDDL has been fully deployed on the Meituan food delivery platform and currently serves over 300 million users.
PIER: Permutation-Level Interest-Based End-to-End Re-ranking Framework in E-commerce
Shi, Xiaowen, Yang, Fan, Wang, Ze, Wu, Xiaoxu, Guan, Muzhi, Liao, Guogang, Wang, Yongkang, Wang, Xingxing, Wang, Dong
Re-ranking draws increased attention on both academics and industries, which rearranges the ranking list by modeling the mutual influence among items to better meet users' demands. Many existing re-ranking methods directly take the initial ranking list as input, and generate the optimal permutation through a well-designed context-wise model, which brings the evaluation-before-reranking problem. Meanwhile, evaluating all candidate permutations brings unacceptable computational costs in practice. Thus, to better balance efficiency and effectiveness, online systems usually use a two-stage architecture which uses some heuristic methods such as beam-search to generate a suitable amount of candidate permutations firstly, which are then fed into the evaluation model to get the optimal permutation. However, existing methods in both stages can be improved through the following aspects. As for generation stage, heuristic methods only use point-wise prediction scores and lack an effective judgment. As for evaluation stage, most existing context-wise evaluation models only consider the item context and lack more fine-grained feature context modeling. This paper presents a novel end-to-end re-ranking framework named PIER to tackle the above challenges which still follows the two-stage architecture and contains two mainly modules named FPSM and OCPM. We apply SimHash in FPSM to select top-K candidates from the full permutation based on user's permutation-level interest in an efficient way. Then we design a novel omnidirectional attention mechanism in OCPM to capture the context information in the permutation. Finally, we jointly train these two modules end-to-end by introducing a comparative learning loss. Offline experiment results demonstrate that PIER outperforms baseline models on both public and industrial datasets, and we have successfully deployed PIER on Meituan food delivery platform.
A Deep Behavior Path Matching Network for Click-Through Rate Prediction
Dong, Jian, Yu, Yisong, Zhang, Yapeng, Lv, Yimin, Wang, Shuli, Jin, Beihong, Wang, Yongkang, Wang, Xingxing, Wang, Dong
User behaviors on an e-commerce app not only contain different kinds of feedback on items but also sometimes imply the cognitive clue of the user's decision-making. For understanding the psychological procedure behind user decisions, we present the behavior path and propose to match the user's current behavior path with historical behavior paths to predict user behaviors on the app. Further, we design a deep neural network for behavior path matching and solve three difficulties in modeling behavior paths: sparsity, noise interference, and accurate matching of behavior paths. In particular, we leverage contrastive learning to augment user behavior paths, provide behavior path self-activation to alleviate the effect of noise, and adopt a two-level matching mechanism to identify the most appropriate candidate. Our model shows excellent performance on two real-world datasets, outperforming the state-of-the-art CTR model. Moreover, our model has been deployed on the Meituan food delivery platform and has accumulated 1.6% improvement in CTR and 1.8% improvement in advertising revenue.
Decision-Making Context Interaction Network for Click-Through Rate Prediction
Li, Xiang, Chen, Shuwei, Dong, Jian, Zhang, Jin, Wang, Yongkang, Wang, Xingxing, Wang, Dong
Click-through rate (CTR) prediction is crucial in recommendation and online advertising systems. Existing methods usually model user behaviors, while ignoring the informative context which influences the user to make a click decision, e.g., click pages and pre-ranking candidates that inform inferences about user interests, leading to suboptimal performance. In this paper, we propose a Decision-Making Context Interaction Network (DCIN), which deploys a carefully designed Context Interaction Unit (CIU) to learn decision-making contexts and thus benefits CTR prediction. In addition, the relationship between different decision-making context sources is explored by the proposed Adaptive Interest Aggregation Unit (AIAU) to improve CTR prediction further. In the experiments on public and industrial datasets, DCIN significantly outperforms the state-of-the-art methods. Notably, the model has obtained the improvement of CTR+2.9%/CPM+2.1%/GMV+1.5% for online A/B testing and served the main traffic of Meituan Waimai advertising system.