lizzie
True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (2 more...)
How do we ensure humanity stays ahead of technology?
Azeem Azhar: Ultimately, we're living beings who've lived in a world that hasn't moved at exponential rates, and so we get caught out by the speed with which these technologies improve. Annie Veillet: Is it too late to start, and to start putting in the right frameworks and controls? Azeem: Society was really disengaged. It looked at technology as manna from heaven that bright and brilliant people produced as gifts from the gods--and far be it for us to ever ask a critical question of it. And we need to stop doing that, right? We need to be there and ask those questions. Lizzie O'Leary: From PwC's management publication strategy and business, this is Take on Tomorrow, the podcast that brings together experts from around the globe to figure out what business could and should be doing to tackle some of the biggest issues we face. Developments such as AI are changing the way we live. But what happens when those changes happen too quickly for business to deal with?
- North America > Canada (0.05)
- North America > United States > New York (0.04)
- Europe (0.04)
- Professional Services (0.62)
- Information Technology (0.47)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.85)
How do Influencers Impact the Media Agenda in AI? - Onalytica
Artificial Intelligence: what is it? How will it impact our lives? How will it be used? Will it actually take over? As the field emerges and becomes more widely discussed, social media's role in shaping public debate answering these questions and more has never been more apparent, with more of us turning to social media for news.