ln null
Appendix for Softmax Deep Double Deterministic Policy Gradients Ling Pan
We demonstrate the smoothing effect of SD3 on the optimization landscape in this section, where experimental setup is the same as in Section 4.1 in the text for the comparative study of SD2 and Experimental details can be found in Section B.2. The performance comparison of SD3 and TD3 is shown in Figure 1(a), where SD3 significantly outperforms TD3. So far, we have demonstrated the smoothing effect of SD3 over TD3. Hyperparameters of DDPG and SD2 are summarized in Table 1. Assume that the actor is a local maximizer with respect to the critic.
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
ShiQ: Bringing back Bellman to LLMs
Clavier, Pierre, Grinsztajn, Nathan, Avalos, Raphael, Flet-Berliac, Yannis, Ergun, Irem, Domingues, Omar D., Tarassov, Eugene, Pietquin, Olivier, Richemond, Pierre H., Strub, Florian, Geist, Matthieu
The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings
Safety and optimality in learning-based control at low computational cost
Baumann, Dominik, Kowalczyk, Krzysztof, Rojas, Cristian R., Tiels, Koen, Wachel, Pawel
Applying machine learning methods to physical systems that are supposed to act in the real world requires providing safety guarantees. However, methods that include such guarantees often come at a high computational cost, making them inapplicable to large datasets and embedded devices with low computational power. In this paper, we propose CoLSafe, a computationally lightweight safe learning algorithm whose computational complexity grows sublinearly with the number of data points. We derive both safety and optimality guarantees and showcase the effectiveness of our algorithm on a seven-degrees-of-freedom robot arm.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Sweden > Uppsala County > Uppsala (0.05)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (9 more...)
Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits
We study the noise-free Gaussian Process (GP) bandits problem, in which the learner seeks to minimize regret through noise-free observations of the black-box objective function lying on the known reproducing kernel Hilbert space (RKHS). Gaussian process upper confidence bound (GP-UCB) is the well-known GP-bandits algorithm whose query points are adaptively chosen based on the GP-based upper confidence bound score. Although several existing works have reported the practical success of GP-UCB, the current theoretical results indicate its suboptimal performance. However, GP-UCB tends to perform well empirically compared with other nearly optimal noise-free algorithms that rely on a non-adaptive sampling scheme of query points. This paper resolves this gap between theoretical and empirical performance by showing the nearly optimal regret upper bound of noise-free GP-UCB. Specifically, our analysis shows the first constant cumulative regret in the noise-free settings for the squared exponential kernel and Mat\'ern kernel with some degree of smoothness.
Mitigating Popularity Bias in Collaborative Filtering through Fair Sampling
Liu, Jiahao, Li, Dongsheng, Gu, Hansu, Zhang, Peng, Lu, Tun, Shang, Li, Gu, Ning
Recommender systems often suffer from popularity bias, where frequently interacted items are overrepresented in recommendations. This bias stems from propensity factors influencing training data, leading to imbalanced exposure. In this paper, we introduce a Fair Sampling (FS) approach to address this issue by ensuring that both users and items are selected with equal probability as positive and negative instances. Unlike traditional inverse propensity score (IPS) methods, FS does not require propensity estimation, eliminating errors associated with inaccurate calculations. Our theoretical analysis demonstrates that FS effectively neutralizes the influence of propensity factors, achieving unbiased learning. Experimental results validate that FS outperforms state-of-the-art methods in both point-wise and pair-wise recommendation tasks, enhancing recommendation fairness without sacrificing accuracy. The implementation is available at https://anonymous.4open.science/r/Fair-Sampling.
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Variational Bayesian inference of hidden stochastic processes with unknown parameters
Atitey, Komlan, Loskot, Pavel, Mihaylova, Lyudmila
Estimating hidden processes from non-linear noisy observations is particularly difficult when the parameters of these processes are not known. This paper adopts a machine learning approach to devise variational Bayesian inference for such scenarios. In particular, a random process generated by the autoregressive moving average (ARMA) linear model is inferred from non-linearity noise observations. The posterior distribution of hidden states are approximated by a set of weighted particles generated by the sequential Monte carlo (SMC) algorithm involving sampling with importance sampling resampling (SISR). Numerical efficiency and estimation accuracy of the proposed inference method are evaluated by computer simulations. Furthermore, the proposed inference method is demonstrated on a practical problem of estimating the missing values in the gene expression time series assuming vector autoregressive (VAR) data model.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)