dvo
Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values
Zhang, Hongbo, Cui, Han, Bao, Guangsheng, Yang, Linyi, Wang, Jun, Zhang, Yue
We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Virginia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
How Many Vote Operations Are Needed to Manipulate A Voting System?
In this paper, we propose a framework to study a general class of strategic behavior in voting, which we call vote operations. We prove the following theorem: if we fix the number of alternatives, generate $n$ votes i.i.d. according to a distribution $\pi$, and let $n$ go to infinity, then for any $\epsilon >0$, with probability at least $1-\epsilon$, the minimum number of operations that are needed for the strategic individual to achieve her goal falls into one of the following four categories: (1) 0, (2) $\Theta(\sqrt n)$, (3) $\Theta(n)$, and (4) $\infty$. This theorem holds for any set of vote operations, any individual vote distribution $\pi$, and any integer generalized scoring rule, which includes (but is not limited to) almost all commonly studied voting rules, e.g., approval voting, all positional scoring rules (including Borda, plurality, and veto), plurality with runoff, Bucklin, Copeland, maximin, STV, and ranked pairs. We also show that many well-studied types of strategic behavior fall under our framework, including (but not limited to) constructive/destructive manipulation, bribery, and control by adding/deleting votes, margin of victory, and minimum manipulation coalition size. Therefore, our main theorem naturally applies to these problems.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (13 more...)
Variable and Value Ordering for MPE Search
Siddiqi, Sajjad Ahmed (Australian National University and National ICT Australia) | Huang, Jinbo (Australian National University and National ICT Australia)
In Bayesian networks, a most probable explanation (MPE) is a most likely instantiation of all network variables given a piece of evidence. Solving (the decision version of) an MPE query is NP-hard. Recent work proposed a branch-and-bound search algorithm that finds exact solutions to MPE queries, where bounds are computed on a relaxed network obtained by a technique known as node splitting. In this work we study the impact of variable and value ordering on such a search algorithm. We study several heuristics based on the entropies of variables and on the notion of nogoods, and propose a new meta-heuristic that combines their strengths. Experiments indicate that search efficiency is significantly improved, allowing many hard problems to be solved for the first time.