Goto

Collaborating Authors

 airplane


Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

Neural Information Processing Systems

Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO.







Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings Appendix

Neural Information Processing Systems

We provide hyper-parameters of our models in Table A.1. Table A.1: Hyper-parameters used for training our VisualCSE and AudioCSE. Vision, we use Dropout augmentation (the same strategy in SimCSE) for AudioCSE. We compare unsup-SimCSE and unsup-VisualCSE on a small scale retrieval test. As shown in Table C.1, VisualCSE generally retrieves qualitatively different sentences than SimCSE.


Supplementary for Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

Neural Information Processing Systems

Xiaoqian Wu Shanghai Jiao Tong University enlighten@sjtu.edu.cn In Tab. 1, we conclude the notations in this work for clarity.Notation Definition r A rule. The size of the premise symbols set M . S is the symbol set, and R is the rule set. A \ B The set difference of A and B. D A very large-scale activity images database.