Goto

Collaborating Authors

 trivia


A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Balepur, Nishant, Shu, Matthew, Sung, Yoo Yeon, Goldfarb-Tarrant, Seraphina, Feng, Shi, Yang, Fumeng, Rudinger, Rachel, Boyd-Graber, Jordan Lee

arXiv.org Artificial Intelligence

To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.


Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

Lin, Zhen, Trivedi, Shubhendu, Sun, Jimeng

arXiv.org Artificial Intelligence

The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.


Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Lin, Zhen, Trivedi, Shubhendu, Sun, Jimeng

arXiv.org Machine Learning

Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for black-box LLMs. We first differentiate uncertainty vs confidence: the former refers to the "dispersion" of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty metrics, applying them to selective NLG where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple metric for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at https://github.com/zlin7/UQ-NLG.


Traffic-Domain Video Question Answering with Automatic Captioning

Qasemi, Ehsan, Francis, Jonathan M., Oltramari, Alessandro

arXiv.org Artificial Intelligence

Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities within the domains of Intelligent Traffic Monitoring and Intelligent Transportation Systems. Nevertheless, the integration of urban traffic scene knowledge into VidQA systems has received limited attention in previous research endeavors. In this work, we present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models. Empirical findings obtained from the SUTD-TrafficQA task highlight the substantial enhancements achieved by TRIVIA, elevating the accuracy of representative video-language models by a remarkable 6.5 points (19.88%) compared to baseline settings. This pioneering methodology holds great promise for driving advancements in the field, inspiring researchers and practitioners alike to unlock the full potential of emerging video-language models in traffic-related applications.


15 questions to ask your Amazon Echo that will leave you laughing out loud

Daily Mail - Science & tech

If you have an Amazon Echo device in your home, you most likely use it for everyday uses, such as listening to music or checking the news. However, there are a range of interesting things Alexa can also do. Once you know what to ask, you can put your Alexa to the test and ask her to supply you with hilarious jokes, pop culture references, trivia, and much more. These hidden features will certainly not leave users disappointed and are worth giving a go. Here is a list of 15 questions you can ask Alexa to lighten the mood or to tackle your boredom.


Improving Question Answering with Generation of NQ-like Questions

Bandyopadhyay, Saptarashmi, Pal, Shraman, Zou, Hao, Chandra, Abhranil, Boyd-Graber, Jordan

arXiv.org Artificial Intelligence

Question Answering (QA) systems require a large amount of annotated data which is costly and time-consuming to gather. Converting datasets of existing QA benchmarks are challenging due to different formats and complexities. To address these issues, we propose an algorithm to automatically generate shorter questions resembling day-to-day human communication in the Natural Questions (NQ) dataset from longer trivia questions in Quizbowl (QB) dataset by leveraging conversion in style among the datasets. This provides an automated way to generate more data for our QA systems. To ensure quality as well as quantity of data, we detect and remove ill-formed questions using a neural classifier. We demonstrate that in a low resource setting, using the generated data improves the QA performance over the baseline system on both NQ and QB data. Our algorithm improves the scalability of training data while maintaining quality of data for QA systems.


AI Papers to Read in 2022

#artificialintelligence

Further Reading: Regarding this discussion, reading the original paper and the authors' subsequent reply can be interesting. Fast forward to 2022, although the authors rectified most concerns, the initial consideration should not be forgotten: transparency and reproducibility are paramount.


Prakash

AAAI Conferences

Trivia is any fact about an entity which is interesting due to its unusualness, uniqueness, unexpectedness or weirdness. In this paper, we propose a novel approach for mining entity trivia from their Wikipedia pages. Given an entity, our system extracts relevant sentences from its Wikipedia page and produces a list of sentences ranked based on their interestingness as trivia. At the heart of our system lies an interestingness ranker which learns the notion of interestingness, through a rich set of domain-independent linguistic and entity based features. Our ranking model is trained by leveraging existing user-generated trivia data available on the Web instead of creating new labeled data. We evaluated our system on movies domain and observed that the system performs significantly better than the defined baselines. A thorough qualitative analysis of the results revealed that our rich set of features indeed help in surfacing interesting trivia in the top ranks.


Today on Technology: Your Online Guidebook on Digital Transformation

#artificialintelligence

"Today each organization must know how to build its digital capability. Because now every company is a software company, every organization is a digital organization." Recently, an article published by the Harvard Business Review gave holistic advice on how in terms of a technology renaissance, we ought to not forget our humanistic side. A very unconventional beginning to a write-up which will solely speak about the whole nine yards of tech, but since digital transformation services are about bringing change to the existing reality, it'll cease to exist sans a touch of humanism. The latter half of the 20th century was the genesis of the'Age of Information' where progression was made from orthodox industrial techniques to the forever evolving Information and Technology. From analogue, everything turned digital. Let's understand it layer by layer. In simple terms, Digital transformation is the impact and influence of technology into each and every business vertical. And when we say technology, we mean digital. But it doesn't restrict itself to that. It's equally a colossal cultural change that thrives on experimentation, brainstorming, challenging metacognitive skills and coping with failure.


'Chipotle IQ' test offers prizes for first 250,000 winners, finds them by lunchtime

FOX News

Despite only announcing the promotion on Wednesday morning, Chipotle Mexican Grill has already awarded all of the allotted prizes to the first 250,000 winning participants of an online "Chipotle IQ" quiz. THIS IS EACH STATE'S FAVORITE FAST-FOOD FRENCH FRY, STUDY CLAIMS The quiz, which promised digital buy-one-get-one coupons for winners, tasked participants with successfully answering 10 trivia questions about the chain's "sourcing, ingredients, recipes, and sustainability efforts." Some of the questions, such as, "What percentage of Chipotle bowls are made of compostable fiber?", seem quite obviously designed to tout the brand's sustainability measures (spoiler alert: it's 100%), while others, like "When is Chipotle's birthday," are a little more obscure. "There were 250K Chipotle brainiacs who came before you, so unfortunately we're fresh out of BOGO prizes," reads a message greeting visitors to the Chipotle IQ test. "Chipotle IQ allows our customers to discover Chipotle in a whole new way and rewards our most devoted brand experts," said Chris Brandt, Chipotle's chief marketing officer, in a press release. "We're introducing a test our fans will actually be excited to take."