Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs
Kong, Lecheng, Wang, Xiyuan, Chen, Yixin, Zhang, Muhan
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model's round-trip consistency and its performance on the primary tasks. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly boosts performance and consistency over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models. Large Language Models (LLMs) are emerging as a powerful class of versatile and generalizable foundation models for computational chemistry (Fang et al.; Cao et al., 2025; Wang et al., 2025). A key advantage is their ability to provide a more flexible and intuitive interface for scientists, enabling interaction with complex data through natural language. This is powered by their unique capability to process and generate information across diverse chemical modalities, from structured SMILES strings to unstructured experimental procedures (Zhao et al., 2025; Pei et al., 2024a). By unifying these disparate tasks and data types within a single framework, LLMs promise to significantly accelerate the pace of scientific discovery. A prominent theme enabled by this new generation of models is the modeling of bidirectional chemical tasks, such as translating between textual descriptions and molecular structures or mapping between forward synthesis and retrosynthesis (Pei et al., 2024b; Edwards et al., 2024; Liu et al., 2024). More than just a capability, however, this bidirectionality allows us to examine a model's depth of understanding through the crucial property of round-trip consistency (RTC).
arXiv.org Artificial Intelligence
Oct-3-2025
- Country:
- Africa > Mali (0.04)
- Asia
- Middle East > Jordan (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- North America
- Dominican Republic (0.04)
- United States (0.15)
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine > Therapeutic Area (0.67)
- Technology: