A Task-Level Case Study

Neural Information Processing Systems 

This section illustrates how a model's performance may vary across different tasks associated with the same new term. We analyzed the performance of Llama-3-Instruct-70B on the new term "wokely," defined as an adjective meaning "Of little worth; poor, mean, paltry." The model's performance varied across three tasks under the zero-shot Base setting: Task Question Response COMA The book's cover was described as wokely by several reviewers. I am A () hesitating among these options. Help me choose the more likely effect: A. it struggled to attract attention on the bookstore displays despite a compelling narrative inside. B. many readers were enticed to buy it, strengthening its presence on the bestseller list. C. readers were intrigued and the book's sales experienced an unexpected surge worldwide. D. the publisher decided to release a limited edition with a special hardback velvet cover. COST The goods at the flea market appeared distinctly _, making it hard to D (X) find a satisfying purchase. CSJ His contributions to the project were considered wokely, barely making Incorrect (X) any impact. Is this example in line with commonsense and grammatically correct? As observed, the model only answered correctly in the COMA task but failed in the other two tasks. In the COMA task, the model successfully inferred that "wokely" carries a negative connotation, allowing it to correctly choose choice A. This demonstrates its ability to comprehend the new term within a helpful context. However, in the COST task, where the model needed to utilize the new term and distinguish it from similar choices, it struggled. Although the phrase "hard to find a satisfying purchase" suggested the need for a negative term, the model incorrectly chose "Worthy," which is grammatically correct but semantically incorrect. In the CSJ task, the model was required to process and interpret the new term in the absence of helpful context. The context matched the definition of "wokely" perfectly, yet the model erroneously judged the response as incorrect because it was a judgment-based evaluation.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found