Problem Solving
Deep Bayesian Active Learning for Preference Modeling in Large Language Models
Melo, Luckeciano C., Tigas, Panagiotis, Abate, Alessandro, Gal, Yarin
Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.
Cognitively Inspired Energy-Based World Models
Gladstone, Alexi, Nanduru, Ganesh, Islam, Md Mofijul, Chadha, Aman, Li, Jundong, Iqbal, Tariq
One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.
Only people with eagle eyes can solve a new Rubik's cube brainteaser in under 30 seconds
The Rubik's cube is a classic mind game for all ages, challenging players to align a single color on each side. The popular 3D puzzle has been recreated into a brainteaser that shows dozens of cubes that appear identical - but there is an odd one in the bunch. The objective is to spot the cube that does not match in under 30 seconds - but only those with eagle eyes can spot it. The popular 3D puzzle has been recreated into a brainteaser that shows dozens of cubes that appear identical - but there is an odd one in the bunch. The new brainteaser was created by online gaming experts at MrQ who said the puzzle will'leave even the most eagle-eyed viewers scratching their heads in anguish.' 'It takes the average person 30 seconds to find the odd Rubik's cube out and a whopping one in three admitting to giving up finding the colorful cube completely,' the company shared.
Multi-objective Reinforcement learning from AI Feedback
This paper presents Multi-Objective Reinforcement Learning from AI Feedback (MORLAIF), a novel approach to improving the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF). In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into multiple simpler principles, such as toxicity, factuality, and sycophancy. Separate preference models are trained for each principle using feedback from GPT-3.5-Turbo. These preference model scores are then combined using different scalarization functions to provide a reward signal for Proximal Policy Optimization (PPO) training of the target language model. Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones. Surprisingly, the choice of scalarization function does not appear to significantly impact the results. Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of natural language tasks (Bommasani et al. (2021); Brown et al. (2020)). However, ensuring that these models behave in alignment with human values and preferences remains a significant challenge (Kenton et al. (2021)). Reinforcement learning from human feedback (RLHF) has emerged as a promising approach to address this issue by training models to optimize for humanspecified reward functions (Christiano et al. (2017); Stiennon et al. (2020)), but many issues with RLHF have been identified, such as the limited ability of humans to evaluate responses and reward hacking of the preference model (Casper et al. (2023)). In a standard RLHF setup, a preference model is trained on human comparisons of model outputs to represent human preferences.
FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning
Li, Xiao, Zhu, Bolin, Liu, Sichen, Zhu, Yin, Liu, Yiwei, Cheng, Gong
The application of formulas is a fundamental ability of humans when addressing numerical reasoning problems. However, existing numerical reasoning datasets seldom explicitly indicate the formulas employed during the reasoning steps. To bridge this gap, we construct a dataset for formula-based numerical reasoning called FormulaReasoning, which consists of 5,420 reasoning-based questions. We employ it to conduct evaluations of LLMs with size ranging from 7B to over 100B parameters utilizing zero-shot and few-shot chain-of-thought methods, and we further explore using retrieval-augmented LLMs provided with an external formula database associated with our dataset. We also experiment with supervised methods where we divide the reasoning process into formula generation, parameter extraction, and numerical calculation, and perform data augmentation. Our empirical findings underscore the significant potential for improvement in existing models when applied to our complex, formula-driven FormulaReasoning.
Pandora: Towards General World Model with Natural Language Actions and Video States
Xiang, Jiannan, Liu, Guangyi, Gu, Yi, Gao, Qiyue, Ning, Yuting, Zha, Yuheng, Feng, Zeyu, Tao, Tianhua, Hao, Shibo, Shi, Yemin, Liu, Zhengzhong, Xing, Eric P., Hu, Zhiting
World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.
From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Wu, Shiwei, Zhang, Chao, Chen, Joya, Xu, Tong, Wu, Likang, Hu, Yao, Chen, Enhong
People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbf{Con}textual \textbf{So}cial \textbf{R}elationships (\textbf{ConSoR}) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2\% gain on the People-in-Social-Context (PISC) dataset and a 9.8\% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships.
LOP-Field: Brain-inspired Layout-Object-Position Fields for Robotic Scene Understanding
Hou, Jiawei, Guan, Wenhao, Xue, Xiangyang, Zeng, Taiping
Spatial cognition empowers animals with remarkably efficient navigation abilities, largely depending on the scene-level understanding of spatial environments. Recently, it has been found that a neural population in the postrhinal cortex of rat brains is more strongly tuned to the spatial layout rather than objects in a scene. Inspired by the representations of spatial layout in local scenes to encode different regions separately, we proposed LOP-Field that realizes the Layout-Object-Position(LOP) association to model the hierarchical representations for robotic scene understanding. Powered by foundation models and implicit scene representation, a neural field is implemented as a scene memory for robots, storing a queryable representation of scenes with position-wise, object-wise, and layout-wise information. To validate the built LOP association, the model is tested to infer region information from 3D positions with quantitative metrics, achieving an average accuracy of more than 88\%. It is also shown that the proposed method using region information can achieve improved object and view localization results with text and RGB input compared to state-of-the-art localization methods.
How Useful is Intermittent, Asynchronous Expert Feedback for Bayesian Optimization?
Kristiadi, Agustinus, Strieth-Kalthoff, Felix, Subramanian, Sriram Ganapathi, Fortuin, Vincent, Poupart, Pascal, Pleiss, Geoff
Bayesian optimization (BO) is an integral part of automated scientific discovery -- the so-called self-driving lab -- where human inputs are ideally minimal or at least non-blocking. However, scientists often have strong intuition, and thus human feedback is still useful. Nevertheless, prior works in enhancing BO with expert feedback, such as by incorporating it in an offline or online but blocking (arrives at each BO iteration) manner, are incompatible with the spirit of self-driving labs. In this work, we study whether a small amount of randomly arriving expert feedback that is being incorporated in a non-blocking manner can improve a BO campaign. To this end, we run an additional, independent computing thread on top of the BO loop to handle the feedback-gathering process. The gathered feedback is used to learn a Bayesian preference model that can readily be incorporated into the BO thread, to steer its exploration-exploitation process. Experiments on toy and chemistry datasets suggest that even just a few intermittent, asynchronous expert feedback can be useful for improving or constraining BO. This can especially be useful for its implication in improving self-driving labs, e.g. making them more data-efficient and less costly.