AITopics | Aharoni, Roee

Collaborating Authors

Aharoni, Roee

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Yona, Gal, Honovich, Or, Levy, Omer, Aharoni, Roee

arXiv.org Artificial IntelligenceOct-20-2024

Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains -- mathematical reasoning and factual knowledge -- reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains $k$ answers by using only $10$ model samples and similarly guessing the remaining $k-10$ attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.15466

Country: Asia > Middle East (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Localizing Factual Inconsistencies in Attributable Text Generation

Cattan, Arie, Roit, Paul, Zhang, Shiyue, Wan, David, Aharoni, Roee, Szpektor, Idan, Bansal, Mohit, Dagan, Ido

arXiv.org Artificial IntelligenceOct-9-2024

There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement ($\kappa > 0.7)$. Then, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and open-source LLMs.

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2410.07473

Country:

Europe (1.00)
North America > United States > Louisiana (0.14)
Asia > Middle East > UAE (0.14)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (1.00)
Leisure & Entertainment > Sports (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.89)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

Cattan, Arie, Jacovi, Alon, Fabrikant, Alex, Herzig, Jonathan, Aharoni, Roee, Rashkin, Hannah, Marcus, Dror, Hassidim, Avinatan, Matias, Yossi, Szpektor, Idan, Caciularu, Avi

arXiv.org Artificial IntelligenceJun-23-2024

Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+23\% on average across models) on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.

ipper, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2406.13632

Country:

North America > United States (0.68)
Europe > France > Grand Est > Bas-Rhin (0.14)
Asia > Middle East > UAE (0.14)
Europe > Netherlands > North Holland (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Gekhman, Zorik, Yona, Gal, Aharoni, Roee, Eyal, Matan, Feder, Amir, Reichart, Roi, Herzig, Jonathan

arXiv.org Artificial IntelligenceMay-13-2024

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.05904

Country: North America > United States > Louisiana (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MiMiC: Minimally Modified Counterfactuals in the Representation Space

Singh, Shashwat, Ravfogel, Shauli, Herzig, Jonathan, Aharoni, Roee, Cotterell, Ryan, Kumaraguru, Ponnurangam

arXiv.org Artificial IntelligenceFeb-16-2024

Language models often exhibit undesirable behaviors, such as gender bias or toxic language. Interventions in the representation space were shown effective in mitigating such issues by altering the LM behavior. We first show that two prominent intervention techniques, Linear Erasure and Steering Vectors, do not enable a high degree of control and are limited in expressivity. We then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g., "toxic") resemble those of a target class (e.g., "non-toxic"). This approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the Earth Mover's problem under Gaussian assumptions and provides theoretical guarantees on the representation space's geometric organization. We further build on this technique and derive a nonlinear intervention that enables controlled generation. We demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2402.09631

Country:

Europe (0.67)
North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Jacovi, Alon, Bitton, Yonatan, Bohnet, Bernd, Herzig, Jonathan, Honovich, Or, Tseng, Michael, Collins, Michael, Aharoni, Roee, Geva, Mor

arXiv.org Artificial IntelligenceFeb-2-2024

Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning steps to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce Reveal: Reasoning Verification Evaluation, a new dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question answering settings. Reveal includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a wide variety of datasets and state-of-the-art language models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.00559

Country:

Europe (0.67)
North America > United States > Georgia > Dougherty County > Albany (0.14)

Genre: Workflow (1.00)

Industry:

Health & Medicine (0.68)
Leisure & Entertainment > Sports > Soccer (0.67)
Leisure & Entertainment > Sports > Basketball (0.67)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Yona, Gal, Aharoni, Roee, Geva, Mor

arXiv.org Artificial IntelligenceJan-9-2024

Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2401.04695

Country:

North America > United States (1.00)
Europe > United Kingdom > England > Greater London > London (0.28)

Genre: Research Report (0.82)

Industry: Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Shaham, Uri, Herzig, Jonathan, Aharoni, Roee, Szpektor, Idan, Tsarfaty, Reut, Eyal, Matan

arXiv.org Artificial IntelligenceJan-8-2024

As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. One promising approach is cross-lingual transfer, where a model acquires specific functionality on some language by finetuning on another language. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in several languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that increasing the number of languages in the instruction tuning set from 1 to only 2, 3, or 4 increases cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2401.01854

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

mFACE: Multilingual Summarization with Factual Consistency Evaluation

Aharoni, Roee, Narayan, Shashi, Maynez, Joshua, Herzig, Jonathan, Clark, Elizabeth, Lapata, Mirella

arXiv.org Artificial IntelligenceJan-5-2024

Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.10622

Country:

Europe (1.00)
Asia (1.00)
Africa > Middle East > Algeria (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.70)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Comprehensive Evaluation of Tool-Assisted Generation Strategies

Jacovi, Alon, Caciularu, Avi, Herzig, Jonathan, Aharoni, Roee, Bohnet, Bernd, Geva, Mor

arXiv.org Artificial IntelligenceDec-28-2023

A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2310.10062

Country:

Europe > United Kingdom (0.67)
North America > United States (0.46)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.46)
Government > Regional Government > Europe Government > United Kingdom Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback