wombat
- Asia > Middle East > Jordan (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Australia > Tasmania (0.04)
- (6 more...)
RRHF (1)
RRHF can align with not only human preferences but also any preferences. As a large language model, Wombat has the possibility to generate unsafe responses. We also conduct experiments on the IMDB dataset for assessing positive movie reviews generation. The task expects the model to give positive and fluent movie review completions based on given partial review input texts. RRHF-OP-128 follows the bottommost workflow in Figure 2 in the main texts.
- Oceania > New Zealand (0.05)
- Oceania > Australia > Tasmania (0.05)
- Media > Film (0.56)
- Leisure & Entertainment (0.56)
- Oceania > New Zealand (0.05)
- Oceania > Australia > Tasmania (0.05)
- Asia > Middle East > Jordan (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Australia > Tasmania (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Intriguing Equivalence Structures of the Embedding Space of Vision Transformers
Salman, Shaeke, Shams, Md Montasir Bin, Liu, Xiuwen
Pre-trained large foundation models play a central role in the recent surge of artificial intelligence, resulting in fine-tuned models with remarkable abilities when measured on benchmark datasets, standard exams, and applications. Due to their inherent complexity, these models are not well understood. While small adversarial inputs to such models are well known, the structures of the representation space are not well characterized despite their fundamental importance. In this paper, using the vision transformers as an example due to the continuous nature of their input space, we show via analyses and systematic experiments that the representation space consists of large piecewise linear subspaces where there exist very different inputs sharing the same representations, and at the same time, local normal spaces where there are visually indistinguishable inputs having very different representations. The empirical results are further verified using the local directional estimations of the Lipschitz constants of the underlying models. Consequently, the resulting representations change the results of downstream models, and such models are subject to overgeneralization and with limited semantically meaningful generalization capability.
- Health & Medicine (0.46)
- Law (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
Yuan, Zheng, Yuan, Hongyi, Tan, Chuanqi, Wang, Wei, Huang, Songfang, Huang, Fei
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner.
- Asia > Middle East > Jordan (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Australia > Tasmania (0.04)
- (6 more...)