AITopics | Koncel-Kedziorski, Rik

Collaborating Authors

Koncel-Kedziorski, Rik

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Are Language Model Logits Calibrated?

Lovering, Charles, Krumdick, Michael, Lai, Viet Dac, Kumar, Nilesh, Reddy, Varshini, Koncel-Kedziorski, Rik, Tanner, Chris

arXiv.org Artificial IntelligenceOct-21-2024

Some information is factual (e.g., "Paris is in France"), whereas other information is probabilistic (e.g., "the coin flip will be a [Heads/T ails]."). We believe that good Language Models (LMs) should understand and reflect this nuance. Our work investigates this by testing if LMs' output probabilities are calibrated to their textual contexts. We define model "calibration" as the degree to which the output probabilities of candidate tokens are aligned with the relative likelihood that should be inferred from the given context. For example, if the context concerns two equally likely options (e.g., heads or tails for a fair coin), the output probabilities should reflect this. Likewise, context that concerns non-uniformly likely events (e.g., rolling a six with a die) should also be appropriately captured with proportionate output probabilities. We find that even in simple settings the best LMs (1) are poorly calibrated, and (2) have systematic biases (e.g., preferred colors and sensitivities to word orderings). For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options' implied likelihood, whereas Llama-3.1-8B Our other consistent finding is mode-collapse: Instruction-tuned models often over-allocate probability mass on a single option. These systematic biases introduce non-intuitive model behavior, making models harder for users to understand. We investigate the extent to which language model (LM) output probabilities are calibrated to the numeric content of their contexts. Figure 1: Models produce un-calibrated results. Inputting Examples 1 and 2 to gpt-4o different, uncalibrated behaviors arise in the model probabilities.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.16007

Country: Europe > France (0.24)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DocFinQA: A Long-Context Financial Reasoning Dataset

Reddy, Varshini, Koncel-Kedziorski, Rik, Lai, Viet Dac, Tanner, Chris

arXiv.org Artificial IntelligenceJan-12-2024

Research in quantitative reasoning within the financial domain indeed necessitates the use of realistic tasks and data, primarily because of the significant impact of decisions made in business and finance. Financial professionals often interact with documents hundreds of pages long, but most research datasets drastically reduce this context length. To address this, we introduce a long-document financial QA task. We augment 7,621 questions from the existing FinQA dataset with full-document context, extending the average context length for each question from under 700 words in FinQA to 123k words in DocFinQA. We conduct extensive experiments of retrieval-based QA pipelines and long-context language models on the augmented data. Our results show that DocFinQA provides challenges for even the strongest, state-of-the-art systems.

docfinqa, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2401.06915

Country:

North America > United States (1.00)
Europe (0.93)
Asia (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance > Trading (0.69)
Law > Business Law (0.69)
Government > Regional Government > North America Government > United States Government (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

BizBench: A Quantitative Reasoning Benchmark for Business and Finance

Koncel-Kedziorski, Rik, Krumdick, Michael, Lai, Viet, Reddy, Varshini, Lovering, Charles, Tanner, Chris

arXiv.org Artificial IntelligenceNov-11-2023

As large language models (LLMs) impact a growing number of complex domains, it is becoming increasingly important to have fair, accurate, and rigorous evaluation benchmarks. Evaluating the reasoning skills required for business and financial NLP stands out as a particularly difficult challenge. We introduce BizBench, a new benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises 8 quantitative reasoning tasks. Notably, BizBench targets the complex task of question-answering (QA) for structured and unstructured financial data via program synthesis (i.e., code generation). We introduce three diverse financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate distinct financial reasoning capabilities required to solve these QA tasks: reading comprehension of financial text and tables, which is required to extract correct intermediate values; and understanding domain knowledge (e.g., financial formulas) needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to extract numeric entities from financial documents, and capacity to solve problems with code. We conduct an in-depth evaluation of open-source and commercial LLMs, illustrating that BizBench is a challenging benchmark for quantitative reasoning in the finance and business domain.

large language model, natural language, quantitative reasoning benchmark, (3 more...)

arXiv.org Artificial Intelligence

2311.06602

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback

Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

Gabburo, Matteo, Garg, Siddhant, Koncel-Kedziorski, Rik, Moschitti, Alessandro

arXiv.org Artificial IntelligenceMay-24-2023

Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.

genqa model, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

2305.15344

Country:

Asia (0.93)
North America > United States > Minnesota (0.46)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation

Muller, Benjamin, Soldaini, Luca, Koncel-Kedziorski, Rik, Lind, Eric, Moschitti, Alessandro

arXiv.org Artificial IntelligenceDec-19-2022

Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.

computational linguistic, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2110.0715

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Data-Driven Methods for Solving Algebra Word Problems

Robaidek, Benjamin, Koncel-Kedziorski, Rik, Hajishirzi, Hannaneh

arXiv.org Artificial IntelligenceApr-27-2018

We explore contemporary, data-driven techniques for solving math word problems over recent large-scale datasets. We show that well-tuned neural equation classifiers can outperform more sophisticated models such as sequence to sequence and self-attention across these datasets. Our error analysis indicates that, while fully data driven models show some promise, semantic and world knowledge is necessary for further advances.

deep learning, neural network, word problem, (21 more...)

arXiv.org Artificial Intelligence

1804.10718

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback