Goto

Collaborating Authors

 guidance model


Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation

Xie, Guofu, Zhang, Chen, Zhang, Xiao, Shi, Yunsheng, Yao, Ting, Xu, Jun

arXiv.org Artificial Intelligence

Adapting to diverse user needs at test time is a key challenge in controllable multi-objective generation. Existing methods are insufficient: merging-based approaches provide indirect, suboptimal control at the parameter level, often disregarding the impacts of multiple objectives. While decoding-based guidance is more direct, it typically requires aggregating logits from multiple expert models, incurring significant space overhead and relying heavily on individual model capacity. To address these issues, we introduce Merge-And-GuidE (MAGE), a two-stage framework that leverages model merging for guided decoding. We first identify a critical compatibility problem between the guidance and base models. In Stage 1, MAGE resolves this by dynamically constructing a more robust base model, merging a series of backbone models that account for multiple objectives. In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1. Our analysis empirically validates Linear Mode Connectivity (LMC) in value models, explores the relationship between model merging and prediction ensembling, and demonstrates the enhanced controllability afforded by our approach. Extensive experiments show that our method outperforms existing approaches, achieving superior controllability, Pareto-optimal performance, and enhanced adaptability.


Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation

Choi, YoungChan, Wang, HengFei, Cheng, YiHua, Kim, Boeun, Chang, Hyung Jin, Choi, YoungGeun, Choi, Sang-Il

arXiv.org Artificial Intelligence

We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.


TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Wang, Guoxin, Wang, Qingyuan, Huang, Binhua, Chen, Shaowu, John, Deepu

arXiv.org Artificial Intelligence

ABSTRACT Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose Tiny-Drop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT -based classification.

  Country:
  Genre: Research Report (0.64)

DoPI: Doctor-like Proactive Interrogation LLM for Traditional Chinese Medicine

Sun, Zewen, Huang, Ruoxiang, Feng, Jiahe, Kong, Rundong, Wang, Yuqian, Liu, Hengyu, Gong, Ziqi, Qin, Yuyuan, Wang, Yingxue, Wang, Yu

arXiv.org Artificial Intelligence

Enhancing interrogation capabilities in Traditional Chinese Medicine (TCM) diagnosis through multi-turn dialogues and knowledge graphs presents a significant challenge for modern AI systems. Current large language models (LLMs), despite their advancements, exhibit notable limitations in medical applications, particularly in conducting effective multi-turn dialogues and proactive questioning. These shortcomings hinder their practical application and effectiveness in simulating real-world diagnostic scenarios. To address these limitations, we propose DoPI, a novel LLM system specifically designed for the TCM domain. The DoPI system introduces a collaborative architecture comprising a guidance model and an expert model. The guidance model conducts multi-turn dialogues with patients and dynamically generates questions based on a knowledge graph to efficiently extract critical symptom information. Simultaneously, the expert model leverages deep TCM expertise to provide final diagnoses and treatment plans. Furthermore, this study constructs a multi-turn doctor-patient dialogue dataset to simulate realistic consultation scenarios and proposes a novel evaluation methodology that does not rely on manually collected real-world consultation data. Experimental results show that the DoPI system achieves an accuracy rate of 84.68 percent in interrogation outcomes, significantly enhancing the model's communication ability during diagnosis while maintaining professional expertise.


Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

Chen, Zhaoyang, Fleming, Cody

arXiv.org Artificial Intelligence

Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module's behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines.


AIR: Complex Instruction Generation via Automatic Iterative Refinement

Liu, Wei, He, Yancheng, Huang, Hui, Hu, Chengwei, Liu, Jiaheng, Li, Shilong, Su, Wenbo, Zheng, Bo

arXiv.org Artificial Intelligence

With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich contents and structures in large web corpora. In this paper, we propose a novel automatic iterative refinement framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs' ability to follow complex instructions. The AIR framework consists of two stages: (1)Generate an initial instruction from a document; (2)Iteratively refine instructions with LLM-as-judge guidance by comparing the model's output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model's ability to follow complex instructions, outperforming existing methods for instruction generation.


Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

Wu, Huiwen, Li, Xiaohan, Xu, Xiaogang, Wu, Jiafei, Zhang, Deyi, Liu, Zhe

arXiv.org Artificial Intelligence

The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive `positive' and `negative' models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs.


Training-Free Guidance for Discrete Diffusion Models for Molecular Generation

Kerby, Thomas J., Moon, Kevin R.

arXiv.org Machine Learning

Training-free guidance methods for continuous data have seen an explosion of interest due to the fact that they enable foundation diffusion models to be paired with interchangable guidance models. Currently, equivalent guidance methods for discrete diffusion models are unknown. We present a framework for applying training-free guidance to discrete data and demonstrate its utility on molecular graph generation tasks using the discrete diffusion model architecture of DiGress. We pair this model with guidance functions that return the proportion of heavy atoms that are a specific atom type and the molecular weight of the heavy atoms and demonstrate our method's ability to guide the data generation.


Localizing Moments in Long Video Via Multimodal Guidance

Barrios, Wayner, Soldan, Mattia, Ceballos-Arroyo, Alberto Mario, Heilbron, Fabian Caba, Ghanem, Bernard

arXiv.org Artificial Intelligence

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.


FedDWA: Personalized Federated Learning with Dynamic Weight Adjustment

Liu, Jiahao, Wu, Jiang, Chen, Jinyu, Hu, Miao, Zhou, Yipeng, Wu, Di

arXiv.org Artificial Intelligence

Different from conventional federated learning, personalized federated learning (PFL) is able to train a customized model for each individual client according to its unique requirement. The mainstream approach is to adopt a kind of weighted aggregation method to generate personalized models, in which weights are determined by the loss value or model parameters among different clients. However, such kinds of methods require clients to download others' models. It not only sheer increases communication traffic but also potentially infringes data privacy. In this paper, we propose a new PFL algorithm called \emph{FedDWA (Federated Learning with Dynamic Weight Adjustment)} to address the above problem, which leverages the parameter server (PS) to compute personalized aggregation weights based on collected models from clients. In this way, FedDWA can capture similarities between clients with much less communication overhead. More specifically, we formulate the PFL problem as an optimization problem by minimizing the distance between personalized models and guidance models, so as to customize aggregation weights for each client. Guidance models are obtained by the local one-step ahead adaptation on individual clients. Finally, we conduct extensive experiments using five real datasets and the results demonstrate that FedDWA can significantly reduce the communication traffic and achieve much higher model accuracy than the state-of-the-art approaches.