AITopics

2504.10283

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningFeb-9-2025

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

Fan, Jiajun, Shen, Shuaike, Cheng, Chaoran, Chen, Yuxin, Liang, Chumeng, Liu, Ge

Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods in continuous-time flows. In this paper, we propose an easy-to-use and theoretically sound RL fine-tuning method, which we term Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization (ORW-CFM-W2). Our method integrates RL into the flow matching framework to fine-tune generative models with arbitrary reward functions, without relying on gradients of rewards or filtered datasets. By introducing an online reward-weighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold. To prevent policy collapse and maintain diversity, we incorporate Wasserstein-2 (W2) distance regularization into our method and derive a tractable upper bound for it in flow matching, effectively balancing exploration and exploitation of policy optimization. We provide theoretical analyses to demonstrate the convergence properties and induced data distributions of our method, establishing connections with traditional RL algorithms featuring Kullback-Leibler (KL) regularization and offering a more comprehensive understanding of the underlying mechanisms and learning behavior of our approach. Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation.

machine learning, natural language, reinforcement learning, (21 more...)

2502.06061

Country:

North America > United States (1.00)
Europe (1.00)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > New Finding (0.46)
Instructional Material > Course Syllabus & Notes (0.45)

Industry: Energy > Oil & Gas > Upstream (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

arXiv.org Machine LearningJul-1-2024

Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Wang, Hao, Chen, Zhichao, Shen, Yuan, Fan, Jiajun, Liu, Zhaoran, Yang, Degui, Liu, Xinggao, Li, Haoxuan

Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective--exacerbated by limited data availability for HTE estimation--we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.

artificial intelligence, data mining, machine learning, (19 more...)

2407.01111

Country: North America > United States > California (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

arXiv.org Machine LearningOct-27-2023

Optimal Transport for Treatment Effect Estimation

Wang, Hao, Chen, Zhichao, Fan, Jiajun, Li, Haoxuan, Liu, Tianqiao, Liu, Weiming, Dai, Quanyu, Wang, Yichao, Dong, Zhenhua, Tang, Ruiming

Estimating conditional average treatment effect from observational data is highly challenging due to the existence of treatment selection bias. Prevalent methods mitigate this issue by aligning distributions of different treatment groups in the latent space. However, there are two critical problems that these methods fail to address: (1) mini-batch sampling effects (MSE), which causes misalignment in non-ideal mini-batches with outcome imbalance and outliers; (2) unobserved confounder effects (UCE), which results in inaccurate discrepancy calculation due to the neglect of unobserved confounders. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR), which is a new take on optimal transport in the context of causality. Specifically, based on the framework of stochastic optimal transport, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods.

artificial intelligence, data mining, machine learning, (19 more...)

2310.18286

Country: North America > United States (0.67)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

arXiv.org Artificial IntelligenceOct-8-2023

ConvFormer: Revisiting Transformer for Sequential User Modeling

Wang, Hao, Lian, Jianxun, Wu, Mingqi, Li, Haoxuan, Fan, Jiajun, Xu, Wanyue, Li, Chaozhuo, Xie, Xing

Sequential user modeling, a critical task in personalized recommender systems, focuses on predicting the next item a user would prefer, requiring a deep understanding of user behavior sequences. Despite the remarkable success of Transformer-based models across various domains, their full potential in comprehending user behavior remains untapped. In this paper, we re-examine Transformer-like architectures aiming to advance state-of-the-art performance. We start by revisiting the core building blocks of Transformer-based methods, analyzing the effectiveness of the item-to-item mechanism within the context of sequential user modeling. After conducting a thorough experimental analysis, we identify three essential criteria for devising efficient sequential user models, which we hope will serve as practical guidelines to inspire and shape future designs. Following this, we introduce ConvFormer, a simple but powerful modification to the Transformer architecture that meets these criteria, yielding state-of-the-art results. Additionally, we present an acceleration technique to minimize the complexity associated with processing extremely long sequences. Experiments on four public datasets showcase ConvFormer's superiority and confirm the validity of our proposed criteria.

large language model, machine learning, natural language, (16 more...)

2308.02925

Country: North America > United States > Texas (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-9-2023

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

Fan, Jiajun, Zhuang, Yuzheng, Liu, Yuecheng, Hao, Jianye, Wang, Bin, Zhu, Jiangcheng, Wang, Hao, Xia, Shu-Tao

The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based meta-controllers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.

behavior space, machine learning, reinforcement learning, (16 more...)

2305.05239

Country:

Europe (1.00)
North America > United States > New York (0.28)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceFeb-20-2023

Entire Space Counterfactual Learning: Tuning, Analytical Properties and Industrial Applications

Wang, Hao, Chen, Zhichao, Fan, Jiajun, Huang, Yuxin, Liu, Weiming, Liu, Xinggao

As a basic research problem for building effective recommender systems, post-click conversion rate (CVR) estimation has long been plagued by sample selection bias and data sparsity issues. To address the data sparsity issue, prevalent methods based on entire space multi-task model leverage the sequential pattern of user actions, i.e. exposure $\rightarrow$ click $\rightarrow$ conversion to construct auxiliary learning tasks. However, they still fall short of guaranteeing the unbiasedness of CVR estimates. This paper theoretically demonstrates two defects of these entire space multi-task models: (1) inherent estimation bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) potential independence priority (PIP) for CTCVR estimation, where the causality from click to conversion might be overlooked. This paper further proposes a principled method named entire space counterfactual multi-task model (ESCM$^2$), which employs a counterfactual risk minimizer to handle both IEB and PIP issues at once. To demonstrate the effectiveness of the proposed method, this paper explores its parameter tuning in practice, derives its analytic properties, and showcases its effectiveness in industrial CVR estimation, where ESCM$^2$ can effectively alleviate the intrinsic IEB and PIP issues and outperform baseline models.

artificial intelligence, escm 2, machine learning, (17 more...)

2210.11039

Country: Asia > China (0.28)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.88)

arXiv.org Artificial IntelligenceDec-10-2021

A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions

Fan, Jiajun

The Arcade Learning Environment (ALE) is proposed as an evaluation platform for empirically assessing the generality of agents across dozens of Atari 2600 games. ALE offers various challenging problems and has drawn significant attention from the deep reinforcement learning (RL) community. From Deep Q-Networks (DQN) to Agent57, RL agents seem to achieve superhuman performance in ALE. However, is this the case? In this paper, to explore this problem, we first review the current evaluation metrics in the Atari benchmarks and then reveal that the current evaluation criteria of achieving superhuman performance are inappropriate, which underestimated the human performance relative to what is possible. To handle those problems and promote the development of RL research, we propose a novel Atari benchmark based on human world records (HWR), which puts forward higher requirements for RL agents on both final performance and learning efficiency. Furthermore, we summarize the state-of-the-art (SOTA) methods in Atari benchmarks and provide benchmark results over new evaluation metrics based on human world records. We concluded that at least four open challenges hinder RL agents from achieving superhuman performance from those new benchmark results. Finally, we also discuss some promising ways to handle those problems.

artificial intelligence, machine learning, reinforcement learning, (21 more...)

2112.04145

Country: Asia > China (0.14)

Genre:

Research Report (1.00)
Overview (0.67)

Industry:

Leisure & Entertainment > Sports (0.94)
Leisure & Entertainment > Games > Computer Games (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-15-2021

GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning

Fan, Jiajun, Xiao, Changnan, Huang, Yue

Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200 training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.

algorithm, computer game, deep learning, (19 more...)

2106.06232

Genre: Research Report (0.81)

Industry:

Leisure & Entertainment > Sports (0.69)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

arXiv.org Artificial IntelligenceJun-1-2021

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Xiao, Changnan, Shi, Haosen, Fan, Jiajun, Deng, Shihong

Policy-based reinforcement learning methods suffer from the policy collapse problem. We find valued-based reinforcement learning methods with {\epsilon}-greedy mechanism are capable of enjoying three characteristics, Closed-form Diversity, Objective-invariant Exploration and Adaptive Trade-off, which help value-based methods avoid the policy collapse problem. However, there does not exist a parallel mechanism for policy-based methods that achieves all three characteristics. In this paper, we propose an entropy regularization free mechanism that is designed for policy-based methods, which achieves Closed-form Diversity, Objective-invariant Exploration and Adaptive Trade-off. Our experiments show that our mechanism is super sample-efficient for policy-based methods and boosts a policy-based baseline to a new State-Of-The-Art on Arcade Learning Environment.

artificial intelligence, behavior policy, reinforcement learning, (15 more...)

2106.00707

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Sports (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)