Qu, Yuxiao
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
Qu, Yuxiao, Yang, Matthew Y. R., Setlur, Amrith, Tunstall, Lewis, Beeching, Edward Emanuel, Salakhutdinov, Ruslan, Kumar, Aviral
Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.
Harnessing Webpage UIs for Text-Rich Visual Understanding
Liu, Junpeng, Ou, Tianyue, Song, Yifan, Qu, Yuxiao, Lam, Wai, Xiong, Chenyan, Chen, Wenhu, Neubig, Graham, Yue, Xiang
Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.
Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning
Corrado, Nicholas E., Qu, Yuxiao, Balis, John U., Labiosa, Adam, Hanna, Josiah P.
Learning from demonstration (LfD) is a popular technique that uses expert demonstrations to learn robot control policies. However, the difficulty in acquiring expert-quality demonstrations limits the applicability of LfD methods: real-world data collection is often costly, and the quality of the demonstrations depends greatly on the demonstrator's abilities and safety concerns. A number of works have leveraged data augmentation (DA) to inexpensively generate additional demonstration data, but most DA works generate augmented data in a random fashion and ultimately produce highly suboptimal data. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight of GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily identify when an augmented trajectory segment represents task progress. Thus, the user can impose a series of simple rules on the DA process to automatically generate augmented samples that approximate expert behavior. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, we find that GuDA enables learning from a small set of potentially suboptimal demonstrations and substantially outperforms a DA strategy that samples augmented data randomly.
FLEE-GNN: A Federated Learning System for Edge-Enhanced Graph Neural Network in Analyzing Geospatial Resilience of Multicommodity Food Flows
Qu, Yuxiao, Rao, Jinmeng, Gao, Song, Zhang, Qianheng, Chao, Wei-Lun, Su, Yu, Miller, Michelle, Morales, Alfonso, Huber, Patrick
Within the networks is a global imperative to tackle increasing food agrifood systems, food supply networks are pivotal in upholding insecurity. However, the complexity of these networks, with their global food security and facilitating the transit, dissemination, multidimensional interactions and decisions, presents significant and sale of food. It's imperative that these networks demonstrate challenges. This paper proposes FLEE-GNN, a novel Federated resilience and sturdiness [12, 15, 21]. Learning System for Edge-Enhanced Graph Neural Network, However, the complexity inherent in them, arising from designed to overcome these challenges and enhance the analysis of diverse food needs, shipment timeframes and costs, promotional geospatial resilience of multicommodity food flow network, which strategies, cultural and environmental considerations, among is one type of spatial networks. FLEE-GNN addresses the limitations others, complicates the assessment of their durability and of current methodologies, such as entropy-based methods, in terms adaptability [2, 23]. Given the intricate nature of food supply of generalizability, scalability, and data privacy. It combines the networks, the concept of resilience is often interpreted in diverse robustness and adaptability of graph neural networks with the ways by different individuals and groups [6, 12, 16, 22]. The term privacy-conscious and decentralized aspects of federated learning "resilience" in this study predominantly pertains to the capacity of on food supply network resilience analysis across geographical the food flow networks to sustain essential food supplies across regions.