AI Safety via Debate
We're proposing an AI safety technique which trains agents to debate topics with one another, using a human to judge who wins. We believe that this or a similar approach could eventually help us train AI systems to perform far more cognitively advanced tasks than humans are capable of, while remaining in line with human preferences. We're going to outline this method together with preliminary proof-of-concept experiments and are also releasing a web interface so people can experiment with the technique. The debate method visualized as a game tree, similar to a game like Go but with sentences between debaters for moves and human judgements at leaf nodes. In both debate and Go, the true answer depends on the entire tree, but a single path through the tree chosen by strong agents is evidence for the whole.
May-3-2018, 21:06:34 GMT