Goto

Collaborating Authors

 tara


LLM Stability: A detailed analysis with some surprises

Atil, Berk, Chittams, Alexa, Fu, Liseng, Ture, Ferhan, Xu, Lixinyu, Baldwin, Breck

arXiv.org Artificial Intelligence

LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs, but we have been unable to find work that evaluates LLM stability as the main objective. In our study of 6 deterministically configured LLMs across 8 common tasks with 5 identical runs, we see accuracy variations up to 10\%. In addition, no LLM consistently delivers repeatable accuracy across all tasks. We also show examples of variation that are not normally distributed and compare configurations with zero-shot/few-shot prompting and fine-tuned examples. To better quantify what is going on, we introduce metrics focused on stability: TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement over parsed-out answers. We suggest that stability metrics be integrated into leader boards and research results going forward.


What if Every Decision You Made Came With a Risk Score?

Slate

This story is part of Future Tense Fiction, a monthly series of short stories from Future Tense and Arizona State University's Center for Science and the Imagination about how technology and science will change our lives. By the time Tara returned from the protest, SafeT gauged her Wellness at 60% and Chase felt sick. For the last two hours he'd watched the number on his phone's app tick down, from safe green to warning yellow: 87%, 74%, 60%. On his newsfeed, masked chanters waved signs before the wire cage shielding the five megapipes that breached the marshy shore of Lake Michigan. Each pipe was owned by a consortium of Lakes United companies. Their great steel veins wormed the city, bearing water from LU to the drought-scarred West and South, whose nations paid more per acre-foot than Milwaukee's citizens ever could. On the feed Chase hadn't been able to see Tara or the sign she'd painted that morning: Our Lake, Our Water. What he had seen were the security corps of at least three consortia, clumped beneath their ever-circling camera-drones, bull-horning the chanters that they were risking corporate slander. If arrested, they'd be hauled off to one of the consortia's private prisons. There they could be coerced into confessing they were linebreakers, guerillas who spliced pipes to siphon off clean water to Milwaukee neighborhoods that couldn't afford consortia prices. Protestors sometimes returned from these prisons. Fingers numb, Chase had tapped SafeT to view the breakdown of Tara's Wellness aggregate into its individual components: risk of arrest (15%), risk of indictment (20%), risk of job loss (27%), risk of injury (31%). Even when she had texted home in 30 and he'd cleared her route in the SafeT map--low smoke risk, low contagion risk, 93% chance of safe arrival--his jaw only eased when she stepped through the door. Tara's thin face was ferocious, cheeks red against her yellow hair. Black grease spotted her strong hands. Over the decade they'd shared, he'd watched age sharpen her into herself. Now, impassioned, she was fiercely beautiful. He almost forgot her yellow number, until she saw him, and her smile sagged.