Goto

Collaborating Authors

 fitting model



Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

Zhao, Kangwen, Cai, Jianfeng, Zhu, Jinhua, Sun, Ruopei, Xue, Dongyun, Zhou, Wengang, Li, Li, Li, Houqiang

arXiv.org Artificial Intelligence

Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.


A Semantic-Loss Function Modeling Framework With Task-Oriented Machine Learning Perspectives

Nguyen, Ti Ti, Le, Thanh-Dung, Ha, Vu Nguyen, Chou, Hong-fu, Eappen, Geoffrey, Tran, Duc-Dung, Nguyen-Kha, Hung, Thiruvasagam, Prabhu, Garces-Socarras, Luis M., Gonzalez-Rios, Jorge L., Merlano-Duncan, Juan C., Chatzinotas, Symeon

arXiv.org Artificial Intelligence

The integration of machine learning (ML) has significantly enhanced the capabilities of Earth Observation (EO) systems by enabling the extraction of actionable insights from complex datasets. However, the performance of data-driven EO applications is heavily influenced by the data collection and transmission processes, where limited satellite bandwidth and latency constraints can hinder the full transmission of original data to the receivers. To address this issue, adopting the concepts of Semantic Communication (SC) offers a promising solution by prioritizing the transmission of essential data semantics over raw information. Implementing SC for EO systems requires a thorough understanding of the impact of data processing and communication channel conditions on semantic loss at the processing center. This work proposes a novel data-fitting framework to empirically model the semantic loss using real-world EO datasets and domain-specific insights. The framework quantifies two primary types of semantic loss: (1) source coding loss, assessed via a data quality indicator measuring the impact of processing on raw source data, and (2) transmission loss, evaluated by comparing practical transmission performance against the Shannon limit. Semantic losses are estimated by evaluating the accuracy of EO applications using four task-oriented ML models, EfficientViT, MobileViT, ResNet50-DINO, and ResNet8-KD, on lossy image datasets under varying channel conditions and compression ratios. These results underpin a framework for efficient semantic-loss modeling in bandwidth-constrained EO scenarios, enabling more reliable and effective operations.


A Model-based Multi-Agent Personalized Short-Video Recommender System

Zhou, Peilun, Xu, Xiaoxiao, Hu, Lantao, Li, Han, Jiang, Peng

arXiv.org Artificial Intelligence

Recommender selects and presents top-K items to the user at each online request, and a recommendation session consists of several sequential requests. Formulating a recommendation session as a Markov decision process and solving it by reinforcement learning (RL) framework has attracted increasing attention from both academic and industry communities. In this paper, we propose a RL-based industrial short-video recommender ranking framework, which models and maximizes user watch-time in an environment of user multi-aspect preferences by a collaborative multi-agent formulization. Moreover, our proposed framework adopts a model-based learning approach to alleviate the sample selection bias which is a crucial but intractable problem in industrial recommender system. Extensive offline evaluations and live experiments confirm the effectiveness of our proposed method over alternatives. Our proposed approach has been deployed in our real large-scale short-video sharing platform, successfully serving over hundreds of millions users.


Two-Stage Hybrid Day-Ahead Solar Forecasting

Alanazi, Mohana, Mahoor, Mohsen, Khodaei, Amin

arXiv.org Machine Learning

Abstract--Power supply from renewable resources is on a global rise where it is forecasted that renewable generation will surpass other types of generation in a foreseeable future. Increased generation from renewable resources, mainly solar and wind, exposes the power grid to more vulnerabilities, conceivably due to their variable generation, thus highlighting the importance of accurate forecasting methods. This paper proposes a two-stage day-ahead solar forecasting method that breaks down the forecasting into linear and nonlinear parts, determines subsequent forecasts, and accordingly, improves accuracy of the obtained results. To further reduce the error resulted from nonstationarity of the historical solar radiation data, a data processing approach, including pre-process and post-process levels, is integrated with the proposed method. Numerical simulations on three test days with different weather conditions exhibit the effectiveness of the proposed two-stage model. Figure 1 The new added U.S. electric generation from 2010 to Q1 2016 [2].