Goto

Collaborating Authors

 Technology


FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Neural Information Processing Systems

Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships.


OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

Neural Information Processing Systems

The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://hhyhrhy.github.io/owmm-agent-project.


ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Neural Information Processing Systems

Recent Vision-Language Models (VLMs) have shown strong performance in general-purpose visual understanding and reasoning, but their ability to comprehend the visual grammar of movie shots remains underexplored and insufficiently evaluated. To bridge this gap, we present \textbf{ShotBench}, a dedicated benchmark for assessing VLMs' understanding of cinematic language. ShotBench includes 3,049 still images and 500 video clips drawn from more than 200 films, with each sample annotated by trained annotators or curated from professional cinematography resources, resulting in 3,608 high-quality question-answer pairs. We conduct a comprehensive evaluation of over 20 state-of-the-art VLMs across eight core cinematography dimensions. Our analysis reveals clear limitations in fine-grained perception and cinematic reasoning of current VLMs. To improve VLMs capability in cinematography understanding, we construct a large-scale multimodal dataset, named ShotQA, which contains about 70k Question-Answer pairs derived from movie shots. Besides, we propose ShotVL and train this VLM model with a two-stage training strategy, integrating both supervised fine-tuning and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our model achieves substantial improvements, surpassing all existing strongest open-source and proprietary models evaluated on ShotBench, establishing a new state-of-the-art performance.



FACT: Mitigating Inconsistent Hallucinations in LLMs via Fact-Driven Alternating Code-Text Training

Neural Information Processing Systems

Inconsistent hallucinations remain a major challenge for large language models (LLMs), undermining the accuracy and reliability of fact-based reasoning in real-world applications. Existing approaches often rely on task-specific training or adaptation, such as hand-crafted synthetic datasets for domain tasks or solutions mainly focused on numerical reasoning, thereby limiting generalizability to broader, unseen NLP tasks. Inspired by the structural rigor and logical consistency of programming languages, we observe that fact-based texts can be mapped to programming structures due to their inherent patterns. We further propose FACT, a novel Fact-driven Alternating Code-text Training framework that alternates between text-to-code and code-to-text prediction. FACT is the first task-agnostic paradigm that embeds code and natural language in a shared semantic space, thereby transferring the logical consistency of code to LLM outputs in NLP tasks. Experiments show that with only a small subset of Wiki-40B-en for training, FACT reduces inconsistent hallucinations by 2.7%-8.0%


AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Neural Information Processing Systems

Theory of Mind (ToM), the ability to understand people's minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend.


Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Neural Information Processing Systems

LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.


Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization

Neural Information Processing Systems

The rapid development of LLMs has raised concerns about their potential misuse, leading to various watermarking schemes that typically offer high detectability. However, existing watermarking techniques often face trade-off between watermark detectability and generated text quality. In this paper, we introduce Learning to Watermark (LTW), a novel selective watermarking framework that leverages multi-objective optimization to effectively balance these competing goals. LTW features a lightweight network that adaptively decides when to apply the watermark by analyzing sentence embeddings, token entropy, and current watermarking ratio. Training of the network involves two specifically constructed loss functions that guide the model toward Pareto-optimal solutions, thereby harmonizing watermark detectability and text quality. By integrating LTW with two baseline watermarking methods, our experimental evaluations demonstrate that LTW significantly enhances text quality without compromising detectability. Our selective watermarking approach offers a new perspective for designing watermarks for LLMs and a way to preserve high text quality for watermarks.


Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion

Neural Information Processing Systems

Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines.


Bridging the Gap Between Cross-Domain Theory and Practical Application: A Case Study on Molecular Dissolution

Neural Information Processing Systems

Artificial intelligence (AI) has played a transformative role in chemical research, greatly facilitating the prediction of small molecule properties, simulation of catalytic processes, and material design. These advances are driven by increases in computing power, open source machine learning frameworks, and extensive chemical datasets. However, a persistent challenge is the limited amount of high-quality real-world data, while models calculated based on large amounts of theoretical data are often costly and difficult to deploy, which hinders the applicability of AI models in real-world scenarios. In this study, we enhance the prediction of solute-solvent properties by proposing a novel sample selection method: the iterative core subset extraction (CSIE) framework. CSIE iteratively updates the core sample subset based on information gain to remove redundant features in theoretical data and optimize the performance of the model on real chemical datasets. Furthermore, we introduce an asymmetric molecular interaction graph neural network (AMGNN) that combines positional information and bidirectional edge connections to simulate real-world chemical reaction scenarios to better capture solute-solvent interactions. Experimental results show that our method can accurately extract the core subset and improve the prediction accuracy.