negative reference
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References
Gabburo, Matteo, Garg, Siddhant, Kedziorski, Rik Koncel, Moschitti, Alessandro
Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity metrics transfer well for QA evaluation, but they are limited by the usage of a single correct reference answer. We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation), using multiple reference answers (combining multiple correct and incorrect references) for sentence-form QA. We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems, across multiple academic and industrial datasets, and show that it outperforms previous baselines and obtains the highest correlation with human annotations.
Rating Friends Without Making Enemies
Adamic, Lada A. (University of Michigan) | Lauterbach, Debra (University of Michigan) | Teng, Chun-Yuen (University of Michigan) | Ackerman, Mark (University of Michigan)
As online social networks expand their role beyond maintaining existing relationships, they may look to more faceted ratings to support the formation of new connections between their users. Our study focuses on one community employing faceted ratings, CouchSurfing.org, and combines data analysis of ratings, a large-scale survey, and in-depth interviews. In order to understand the ratings, we revisit the notions of friendship and trust and uncover an asymmetry: close friendship includes trust, but high levels of trust can be achieved without close friendship. To users, providing faceted ratings presents challenges, including differentiating and quantifying inherently subjective feelings such as friendship and trust, concern over a friend's reaction to a rating, and knowledge of how ratings can affect others' reputations. One consequence of these issues is the near absence of negative feedback, even though a small portion of actual experiences and privately held ratings are negative. We show how users take this into account when formulating and interpreting ratings, and discuss designs that could encourage more balanced feedback.