Goto

Collaborating Authors

 Reinforcement Learning


Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

arXiv.org Artificial Intelligence

While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.


AlphaX: An AI-Based Value Investing Strategy for the Brazilian Stock Market

arXiv.org Artificial Intelligence

Autonomous trading strategies have been a subject of research within the field of artificial intelligence (AI) for aconsiderable period. Various AI techniques have been explored to develop autonomous agents capable of trading financial assets. These approaches encompass traditional methods such as neural networks, fuzzy logic, and reinforcement learning, as well as more recent advancements, including deep neural networks and deep reinforcement learning. Many developers report success in creating strategies that exhibit strong performance during simulations using historical price data, a process commonly referred to as backtesting. However, when these strategies are deployed in real markets, their performance often deteriorates, particularly in terms of risk-adjusted returns. In this study, we propose an AI-based strategy inspired by a classical investment paradigm: Value Investing. Financial AI models are highly susceptible to lookahead bias and other forms of bias that can significantly inflate performance in backtesting compared to live trading conditions. To address this issue, we conducted a series of computational simulations while controlling for these biases, thereby reducing the risk of overfitting. Our results indicate that the proposed approach outperforms major Brazilian market benchmarks. Moreover, the strategy, named AlphaX, demonstrated superior performance relative to widely used technical indicators such as the Relative Strength Index (RSI) and Money Flow Index (MFI), with statistically significant results. Finally, we discuss several open challenges and highlight emerging technologies in qualitative analysis that may contribute to the development of a comprehensive AI-based Value Investing framework in the future


Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

arXiv.org Artificial Intelligence

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.


Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots

arXiv.org Artificial Intelligence

Hierarchical reinforcement learning (HRL) is hypothesized to be able to leverage the inherent hierarchy in learning tasks where traditional reinforcement learning (RL) often fails. In this research, HRL is evaluated and contrasted with traditional RL in complex robotic navigation tasks. We evaluate unique characteristics of HRL, including its ability to create sub-goals and the termination functions. We constructed a number of experiments to test: 1) the differences between RL proximal policy optimization (PPO) and HRL, 2) different ways of creating sub-goals in HRL, 3) manual vs automatic sub-goal creation in HRL, and 4) the effects of the frequency of termination on performance in HRL. These experiments highlight the advantages of HRL over RL and how it achieves these advantages.


Reinforcement Learning for Solving the Pricing Problem in Column Generation: Applications to Vehicle Routing

arXiv.org Artificial Intelligence

In this paper, we address the problem of Column Generation (CG) using Reinforcement Learning (RL). Specifically, we use a RL model based on the attention-mechanism architecture to find the columns with most negative reduced cost in the Pricing Problem (PP). Unlike previous Machine Learning (ML) applications for CG, our model deploys an end-to-end mechanism as it independently solves the pricing problem without the help of any heuristic. We consider a variant of Vehicle Routing Problem (VRP) as a case study for our method. Through a set of experiments where our method is compared against a Dynamic Programming (DP)-based heuristic for solving the PP, we show that our method solves the linear relaxation up to a reasonable objective gap in significantly shorter running times.


Adaptive Important Region Selection with Reinforced Hierarchical Search for Dense Object Detection

Neural Information Processing Systems

Existing state-of-the-art dense object detection techniques tend to produce a large number of false positive detections on difficult images with complex scenes because they focus on ensuring a high recall. To improve the detection accuracy, we propose an Adaptive Important Region Selection ( AIRS) framework guided by Evidential Q-learning coupled with a uniquely designed reward function. Inspired by human visual attention, our detection model conducts object search in a top-down, hierarchical fashion. It starts from the top of the hierarchy with the coarsest granularity and then identifies the potential patches likely to contain objects of interest. It then discards non-informative patches and progressively moves downward on the selected ones for a fine-grained search. The proposed evidential Q-learning systematically encodes epistemic uncertainty in its evidential-Q value to encourage the exploration of unknown patches, especially in the early phase of model training. In this way, the proposed model dynamically balances exploration-exploitation to cover both highly valuable and informative patches. Theoretical analysis and extensive experiments on multiple datasets demonstrate that our proposed framework outperforms the SOT A models.



Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

Neural Information Processing Systems

A recent line of work has focused on the Linear Quadratic Regulator (LQR) as a testbed to understand the behavior and trade-offs of various RL algorithms in the continuous state and action space setting.