preference dataset
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > China > Hong Kong (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
- (2 more...)
Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment
Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune (SPIN, Chen et al. [2024]).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Law (1.00)
- Information Technology > Security & Privacy (0.93)
- Education (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > United Kingdom > England > Bristol (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Education (1.00)
- Health & Medicine > Consumer Health (0.92)
- Law (0.67)
- Asia > Middle East > Jordan (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (7 more...)
- Leisure & Entertainment > Sports (0.46)
- Education > Educational Setting > K-12 Education (0.45)
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human votes on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.