AITopics | Yona, Gal

Collaborating Authors

Yona, Gal

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Confidence Improves Self-Consistency in LLMs

Taubenfeld, Amir, Sheffer, Tom, Ofek, Eran, Feder, Amir, Goldstein, Ariel, Gekhman, Zorik, Yona, Gal

arXiv.org Artificial IntelligenceFeb-10-2025

Self-consistency decoding enhances LLMs' performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as sampling many of these (lengthy) paths is required to increase the chances that the correct answer emerges as the most frequent one. To address this, we introduce Confidence-Informed Self-Consistency (CISC). CISC performs a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, it can identify the correct answer with a significantly smaller sample size. When tested on nine models and four datasets, CISC outperforms self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. In addition, we introduce the notion of within-question confidence evaluation, after showing that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. In fact, the most calibrated confidence method proved to be the least effective for CISC. Lastly, beyond these practical implications, our results and analyses show that LLMs can effectively judge the correctness of their own outputs, contributing to the ongoing debate on this topic.

arxiv preprint arxiv, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.06233

Country: Asia > Thailand (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Pathways on the Image Manifold: Image Editing via Video Generation

Rotstein, Noam, Yona, Gal, Silver, Daniel, Velich, Roy, Bensaïd, David, Kimmel, Ron

arXiv.org Artificial IntelligenceNov-27-2024

Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.16819

Genre: Research Report > New Finding (0.68)

Industry: Media > Photography (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Yona, Gal, Honovich, Or, Levy, Omer, Aharoni, Roee

arXiv.org Artificial IntelligenceOct-20-2024

Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains -- mathematical reasoning and factual knowledge -- reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains $k$ answers by using only $10$ model samples and similarly guessing the remaining $k-10$ attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.15466

Country: Asia > Middle East (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Gekhman, Zorik, Yona, Gal, Aharoni, Roee, Eyal, Matan, Feder, Amir, Reichart, Roi, Herzig, Jonathan

arXiv.org Artificial IntelligenceMay-13-2024

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.05904

Country: North America > United States > Louisiana (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Yona, Gal, Aharoni, Roee, Geva, Mor

arXiv.org Artificial IntelligenceJan-9-2024

Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2401.04695

Country:

North America > United States (1.00)
Europe > United Kingdom > England > Greater London > London (0.28)

Genre: Research Report (0.82)

Industry: Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Surfacing Biases in Large Language Models using Contrastive Input Decoding

Yona, Gal, Honovich, Or, Laish, Itay, Aharoni, Roee

arXiv.org Artificial IntelligenceMay-12-2023

Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.

artificial intelligence, continuation, natural language, (14 more...)

arXiv.org Artificial Intelligence

2305.07378

Genre: Research Report (0.64)

Industry: Law (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Malign Overfitting: Interpolation Can Provably Preclude Invariance

Wald, Yoav, Yona, Gal, Shalit, Uri, Carmon, Yair

arXiv.org Artificial IntelligenceNov-28-2022

Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.

artificial intelligence, machine learning, null, (16 more...)

arXiv.org Artificial Intelligence

2211.15724

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.45)

Add feedback

Who's responsible? Jointly quantifying the contribution of the learning algorithm and training data

Yona, Gal, Ghorbani, Amirata, Zou, James

arXiv.org Artificial IntelligenceOct-9-2019

Jointly quantifying the contribution of the learning algorithm and training dataGal Yona Amirata Ghorbani James Zou Weizmann Institute Stanford University Stanford University Abstract A fancy learning algorithm A outperforms a baseline method B when they are both trained on the same data. Should A get all of the credit for the improved performance or does the training data also deserve some credit? When deployed in a new setting from a different domain, however, A makes more mistakes than B . How much of the blame should go to the learning algorithm or the training data? Such questions are becoming increasingly important and prevalent as we aim to make ML more accountable. Their answers would also help us allocate resources between algorithm design and data collection. In this paper, we formalize these questions and provide a principled Extended Shapley framework to jointly quantify the contribution of the learning algorithm and training data. Extended Shapley uniquely satisfies several natural properties that ensure equitable treatment of data and algorithm. Through experiments and theoretical analysis, we demonstrate that Extended Shapley has several important applications: 1) it provides a new metric of ML performance improvement that disentangles the influence of the data regime and the algorithm; 2) it facilitates ML accountability by properly assigning responsibility for mistakes; 3) it provides more robustness to manipulation by the ML designer. Introduction In machine learning (ML), the standard way to evaluate a new learning algorithm A is to compare its performance with the performance of a baseline algorithm B, when A and B are trained on the same dataset D . For example, if A and B achieves 0.9 and 0.7 accuracy, then papers typically report that A is better than B by 0.2. Implicit in this ubiquitous practice is the assumption that A itself is solely responsible for all of the difference in performance.

algorithm, health & medicine, oncology, (18 more...)

arXiv.org Artificial Intelligence

1910.04214

Country: Europe (0.14)

Genre: Research Report (0.51)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Preference-Informed Fairness

Kim, Michael P., Korolova, Aleksandra, Rothblum, Guy N., Yona, Gal

arXiv.org Machine LearningApr-3-2019

As algorithms are increasingly used to make important decisions pertaining to individuals, algorithmic discrimination is becoming a prominent concern. The seminal work of Dwork et al. [ITCS 2012] introduced the notion of individual fairness (IF): given a task-specific similarity metric, every pair of similar individuals should receive similar outcomes. In this work, we study fairness when individuals have diverse preferences over the possible outcomes. We show that in such settings, individual fairness can be too restrictive: requiring individual fairness can lead to less-preferred outcomes for the very individuals that IF aims to protect (e.g. a protected minority group). We introduce and study a new notion of preference-informed individual fairness (PIIF), a relaxation of individual fairness that allows for outcomes that deviate from IF, provided the deviations are in line with individuals' preferences. We show that PIIF can allow for solutions that are considerably more beneficial to individuals than the best IF solution. We further show how to efficiently optimize any convex objective over the outcomes subject to PIIF, for a rich class of individual preferences. Motivated by fairness concerns in targeted advertising, we apply this new fairness notion to the multiple-task setting introduced by Dwork and Ilvento [ITCS 2019]. We show that, in this setting too, PIIF can allow for considerably more beneficial solutions, and we extend our efficient optimization algorithm to this setting.

fairness, game theory, optimization problem, (21 more...)

arXiv.org Machine Learning

1904.01793

Country: North America > United States > California (0.46)

Genre: Research Report (0.64)

Industry:

Marketing (1.00)
Law (1.00)
Information Technology > Services (1.00)
(2 more...)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

Add feedback