Goto

Collaborating Authors

 roscoe


Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

Sheng, Huanxin, Liu, Xinyi, He, Hangfeng, Zhao, Jieyu, Kang, Jian

arXiv.org Artificial Intelligence

LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.


This Talking Pet Collar Is Like a Chatbot for Your Dog

WIRED

Humans have been trying to talk to animals ever since we figured out how to form words. In modern times, we turn to technology for the solution--giving our dogs talking buttons to paw at, or trying to use artificial intelligence to help us understand whales. The latest and perhaps most direct approach at human-animal communication is a voice-activated collar that gives your pet the power to talk back to you. John McHale, a self-described "tech guy" based out of Austin, Texas, has a company called Personifi AI. The startup's goal, as the name implies, is to create tech that will "personify everything," as McHale puts it.


Suspect shoots robotic police dog in Massachusetts standoff; manufacturer says it's a first

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. A robotic dog is being thanked by state police in Massachusetts for helping avert a tragedy involving a person barricaded in a home. The dog named Roscoe was part of the Massachusetts State Police Bomb Squad and deployed on March 6 in a Barnstable house after police were fired upon. Police sent in two other robots often used for bomb disposal into the house to find the suspect along with Roscoe.


ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness

Prasad, Archiki, Saha, Swarnadeep, Zhou, Xiang, Bansal, Mohit

arXiv.org Artificial Intelligence

Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound reasoning quality with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains via two key properties: (1) correctness, i.e., each step makes a valid inference based on information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We evaluate these properties by developing metrics using natural language inference models and V-Information. On multiple datasets, we show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods. We analyze the impact of step boundaries, and previous steps on evaluating correctness and demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains. Finally, we show that scoring reasoning chains based on ReCEval improves downstream task performance. Our code is publicly available at: https://github.com/archiki/ReCEval


Can Language Models Laugh at YouTube Short-form Videos?

Ko, Dayoon, Lee, Sangho, Kim, Gunhee

arXiv.org Artificial Intelligence

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.


[2212.07919] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

#artificialintelligence

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.


Donkey: Building Self Driving Cars with Will Roscoe - Episode 132

#artificialintelligence

Do you wish that you had a self-driving car of your own? With Donkey you can make that dream a reality. This week Will Roscoe shares the story of how he got involved in the arena of self-driving car hobbyists and ended up building a Python library to act as his pilot. We talked about the hardware involved, how he has evolved the code to meet unexpected challenges, and how he plans to improve it in the future. So go build your own self driving car and take it for a spin!


The near-futurism of Disney Channel original movies -- does it hold up?

#artificialintelligence

Does It Hold Up is a chance to re-experience childhood favorites of books, movies, TV shows, video games, and other cultural phenomenon decades later. Have they gotten better like a fine wine, or are we drinking cork? A cornerstone of any pre-teen's life between 1998 to 2007 was the Disney Channel original movie. If you grew up during that time you do not need a refresher on why movies like Halloweentown or Zenon: Girl of the 21st Century were popular -- they were your main option for entertainment because you were constantly at home! (That is what it is like to not have a driver's license.) But you may need a refresher on their content, because I just revisited a bunch of them and they are not what I thought.


The Wager

Cherniak, Christopher

AI Magazine

The Portrait Programs Project grew out of hyperinterdisciplinarianism of the famed Gigabase Sculpture Group, in turn stimulated by recent cutbacks in government support for the arts. The National Endowment for the Humanities and the National Science Foundation had jointly funded the Gigabase Sculpture Project to foster the literary/musical genre of composing genetic codes for novel organisms. Later, artists trained in recombinant DNA technology designed massive Brancusi-esque statues of living cytoplasmic jelly. However, Art For Art's Sake objectives of these giblet sculptors were compromised by precautions necessary after discovery of the "Gogol's-Theorem Bomb" that threatened to get loose and jam all DNA replication in the biosphere; not even viruses would have survived.