field goal
What Has Been Lost with Synthetic Evaluation?
Gill, Alexander, Ravichander, Abhilasha, Marasović, Ana
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Sweden (0.14)
- Europe > Denmark (0.14)
- (33 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Leisure & Entertainment > Sports > Football (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (1.00)
- Education (1.00)
Learning to Attribute with Attention
Cohen-Wang, Benjamin, Chuang, Yung-Sung, Madry, Aleksander
Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .
- Europe > United Kingdom (0.27)
- Asia > Japan (0.04)
- Europe > France (0.04)
- (7 more...)
- Leisure & Entertainment > Sports (0.94)
- Government > Regional Government > North America Government > United States Government (0.93)
- Media (0.68)
The Empirical Impact of Data Sanitization on Language Models
Pal, Anwesan, Bhargava, Radhika, Hinsz, Kyle, Esterhuizen, Jacques, Bhattacharya, Sudipta
Data sanitization in the context of language modeling involves identifying sensitive content, such as personally identifiable information (PII), and redacting them from a dataset corpus. It is a common practice used in natural language processing (NLP) to maintain privacy. Nevertheless, the impact of data sanitization on the language understanding capability of a language model remains less studied. This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks including comprehension question answering (Q&A), entailment, sentiment analysis, and text classification. Our experiments cover a wide spectrum comprising finetuning small-scale language models, to prompting large language models (LLMs), on both original and sanitized datasets, and comparing their performance across the tasks. Interestingly, our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%, while for tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original. For tasks that have a higher impact, we perform a deeper dive to inspect the presence of task-critical entities. Finally, we investigate correlation between performance and number of redacted entities, and also suggest a strategy to repair an already redacted dataset by means of content-based subsampling. Additional details are available at https://sites.google.com/view/datasan.
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia (0.04)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Sports > Football (0.70)
- Law (0.68)
- Health & Medicine (0.67)
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Kim, Joongwon, Paranjape, Bhargavi, Khot, Tushar, Hajishirzi, Hannaneh
Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to address a diverse set of complex tasks involving numerical, tabular, and knowledge-based reasoning. Husky iterates between two stages: 1) generating the next action to take towards solving a given task and 2) executing the action using expert models and updating the current solution state. We identify a thorough ontology of actions for addressing complex tasks and curate high-quality data to train expert models for executing these actions. Our experiments show that Husky outperforms prior language agents across 14 evaluation datasets. Moreover, we introduce HuskyQA, a new evaluation set which stress tests language agents for mixed-tool reasoning, with a focus on retrieving missing knowledge and performing numerical reasoning. Despite using 7B models, Husky matches or even exceeds frontier LMs such as GPT-4 on these tasks, showcasing the efficacy of our holistic approach in addressing complex reasoning problems. Our code and models are available at https://github.com/agent-husky/Husky-v1.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (50 more...)
- Workflow (0.95)
- Research Report > New Finding (0.67)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Transportation > Ground > Rail (1.00)
- (12 more...)
Estimating the age-conditioned average treatment effects curves: An application for assessing load-management strategies in the NBA
Nakamura-Sakai, Shinpei, Forastiere, Laura, Macdonald, Brian
In the realm of competitive sports, understanding the performance dynamics of athletes, represented by the age curve (showing progression, peak, and decline), is vital. Our research introduces a novel framework for quantifying age-specific treatment effects, enhancing the granularity of performance trajectory analysis. Firstly, we propose a methodology for estimating the age curve using game-level data, diverging from traditional season-level data approaches, and tackling its inherent complexities with a meta-learner framework that leverages advanced machine learning models. This approach uncovers intricate non-linear patterns missed by existing methods. Secondly, our framework enables the identification of causal effects, allowing for a detailed examination of age curves under various conditions. By defining the Age-Conditioned Treatment Effect (ACTE), we facilitate the exploration of causal relationships regarding treatment impacts at specific ages. Finally, applying this methodology to study the effects of rest days on performance metrics, particularly across different ages, offers valuable insights into load management strategies' effectiveness. Our findings underscore the importance of tailored rest periods, highlighting their positive impact on athlete performance and suggesting a reevaluation of current management practices for optimizing athlete performance.
- North America > United States > Connecticut > New Haven County > New Haven (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.86)
- Leisure & Entertainment > Sports > Basketball (1.00)
- Health & Medicine > Consumer Health (1.00)
Chiefs' Harrison Butker drills longest field goal in Super Bowl history, breaking record set in 1st half
Fox News Flash top sports headlines are here. Check out what's clicking on Foxnews.com. The first score of the Super Bowl turned out to be historic, but the record didn't stand for long. Kansas City Chiefs kicker Harrison Butker drilled his second field goal of Super Bowl LVIII against the San Francisco 49ers, a 57-yarder that rewrote the record for longest kick in the "Big Game." And that previous record was hit in the first half by 49ers rookie kicker Jake Moody, who hit one home from 56 yards.
- North America > United States > Missouri > Jackson County > Kansas City (0.64)
- North America > United States > California > San Francisco County > San Francisco (0.33)
- North America > United States > Nevada > Clark County > Las Vegas (0.06)
- North America > United States > Michigan (0.06)
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning
Mekala, Rajasekhar Reddy, Razeghi, Yasaman, Singh, Sameer
Language models are achieving impressive performance on various tasks by aggressively adopting inference-time prompting techniques, such as zero-shot and few-shot prompting. In this work, we introduce EchoPrompt, a simple yet effective approach that prompts the model to rephrase its queries before answering them. EchoPrompt is adapted for both zero-shot and few-shot in-context learning with standard and chain-of-thought prompting. Experimental results show that EchoPrompt yields substantial improvements across all these settings for four families of causal language models. These improvements are observed across various numerical reasoning (e.g. GSM8K, SVAMP), reading comprehension (e.g. DROP), and logical reasoning (e.g. Coin Flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. We investigate the factors contributing to EchoPrompt's effectiveness through ablation studies, which reveal that both the original query and the model-generated rephrased version are instrumental in its performance gains. Our empirical results indicate that EchoPrompt is an effective technique that enhances in-context learning performance. We recommend incorporating EchoPrompt into various baseline prompting strategies to achieve performance boosts.
- Europe (0.14)
- North America > Dominican Republic (0.04)
- North America > Central America (0.04)
- (4 more...)
Successive Prompting for Decomposing Complex Questions
Dua, Dheeru, Gupta, Shivanshu, Singh, Sameer, Gardner, Matt
Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce ``Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate a synthetic dataset which can be used to bootstrap a model's ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement of ~5% absolute F1 on a few-shot version of the DROP dataset when compared with a state-of-the-art model with the same supervision.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Mexico > Veracruz (0.05)
- Asia > China > Hong Kong (0.04)
- (6 more...)
- Research Report (0.70)
- Workflow (0.66)
A Neural-Symbolic Approach to Natural Language Understanding
Liu, Zhixuan, Wang, Zihao, Lin, Yuan, Li, Hang
Deep neural networks, empowered by pre-trained language models, have achieved remarkable results in natural language understanding (NLU) tasks. However, their performances can drastically deteriorate when logical reasoning is needed. This is because NLU in principle depends on not only analogical reasoning, which deep neural networks are good at, but also logical reasoning. According to the dual-process theory, analogical reasoning and logical reasoning are respectively carried out by System 1 and System 2 in the human brain. Inspired by the theory, we present a novel framework for NLU called Neural-Symbolic Processor (NSP), which performs analogical reasoning based on neural processing and logical reasoning based on both neural and symbolic processing. As a case study, we conduct experiments on two NLU tasks, question answering (QA) and natural language inference (NLI), when numerical reasoning (a type of logical reasoning) is necessary. The experimental results show that our method significantly outperforms state-of-the-art methods in both tasks.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (3 more...)
Teaching Neural Module Networks to Do Arithmetic
Chen, Jiayi, Guo, Xiao-Yu, Li, Yuan-Fang, Haffari, Gholamreza
Answering complex questions that require multi-step multi-type reasoning over raw text is challenging, especially when conducting numerical reasoning. Neural Module Networks(NMNs), follow the programmer-interpreter framework and design trainable modules to learn different reasoning skills. However, NMNs only have limited reasoning abilities, and lack numerical reasoning capability. We up-grade NMNs by: (a) bridging the gap between its interpreter and the complex questions; (b) introducing addition and subtraction modules that perform numerical reasoning over numbers. On a subset of DROP, experimental results show that our proposed methods enhance NMNs' numerical reasoning skills by 17.7% improvement of F1 score and significantly outperform previous state-of-the-art models.
- Europe > Austria > Vienna (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (13 more...)
- Government (0.95)
- Leisure & Entertainment > Sports > Football (0.74)