dpo
DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO
Zhang, Tiantian, Zuo, Jierui, Wang, Wenping
This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- South America > Argentina (0.04)
- North America > United States > Virginia (0.04)
- South America > Uruguay (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.40)
- North America > United States > California (0.14)
- North America > United States > Alaska (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Media (0.92)
- Information Technology (0.67)
- Leisure & Entertainment (0.67)
- Asia > Japan (0.05)
- North America > United States > Virginia (0.04)
- North America > United States > Hawaii (0.04)
- (5 more...)
- Government (1.00)
- Education (0.93)
- Law (0.69)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.96)