AITopics | Daly, Elizabeth

Collaborating Authors

Daly, Elizabeth

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Marinescu, Radu, Bhattacharjya, Debarun, Lee, Junkyu, Tchrakian, Tigran, Cano, Javier Carnerero, Hou, Yufang, Daly, Elizabeth, Pascale, Alessandra

arXiv.org Artificial IntelligenceFeb-25-2025

Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.18573

Country:

North America > United States (0.67)
Europe > Middle East > Malta (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

BenchmarkCards: Large Language Model and Risk Reporting

Sokol, Anna, Moniz, Nuno, Daly, Elizabeth, Hind, Michael, Chawla, Nitesh

arXiv.org Artificial IntelligenceOct-16-2024

Large language models (LLMs) offer powerful capabilities but also introduce significant risks. One way to mitigate these risks is through comprehensive pre-deployment evaluations using benchmarks designed to test for specific vulnerabilities. However, the rapidly expanding body of LLM benchmark literature lacks a standardized method for documenting crucial benchmark details, hindering consistent use and informed selection. BenchmarkCards addresses this gap by providing a structured framework specifically for documenting LLM benchmark properties rather than defining the entire evaluation process itself. BenchmarkCards do not prescribe how to measure or interpret benchmark results (e.g., defining ``correctness'') but instead offer a standardized way to capture and report critical characteristics like targeted risks and evaluation methodologies, including properties such as bias and fairness. This structured metadata facilitates informed benchmark selection, enabling researchers to choose appropriate benchmarks and promoting transparency and reproducibility in LLM evaluation.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2410.12974

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Hou, Yufang, Pascale, Alessandra, Carnerero-Cano, Javier, Tchrakian, Tigran, Marinescu, Radu, Daly, Elizabeth, Padhi, Inkit, Sattigeri, Prasanna

arXiv.org Artificial IntelligenceJun-19-2024

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.

information, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2406.13805

Country:

Asia (0.28)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Government (1.00)
Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ranking Large Language Models without Ground Truth

Dhurandhar, Amit, Nair, Rahul, Singh, Moninder, Daly, Elizabeth, Ramamurthy, Karthikeyan Natesan

arXiv.org Artificial IntelligenceJun-10-2024

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.1486

Country:

Europe (0.67)
North America > Canada > Ontario (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Explaining Knock-on Effects of Bias Mitigation

Nizhnichenkov, Svetoslav, Nair, Rahul, Daly, Elizabeth, Mac Namee, Brian

arXiv.org Artificial IntelligenceDec-1-2023

In machine learning systems, bias mitigation approaches aim to make outcomes fairer across privileged and unprivileged groups. Bias mitigation methods work in different ways and have known "waterfall" effects, e.g., mitigating bias at one place may manifest bias elsewhere. In this paper, we aim to characterise impacted cohorts when mitigation interventions are applied. To do so, we treat intervention effects as a classification task and learn an explainable meta-classifier to identify cohorts that have altered outcomes. We examine a range of bias mitigation strategies that work at various stages of the model life cycle. We empirically demonstrate that our meta-classifier is able to uncover impacted cohorts. Further, we show that all tested mitigation strategies negatively impact a non-trivial fraction of cases, i.e., people who receive unfavourable outcomes solely on account of mitigation efforts. This is despite improvement in fairness metrics. We use these results as a basis to argue for more careful audits of static mitigation interventions that go beyond aggregate metrics.

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2312.00765

Country:

Europe (1.00)
North America > United States (0.94)

Genre: Research Report > New Finding (0.47)

Industry: Government > Regional Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Iterative Reward Shaping using Human Feedback for Correcting Reward Misspecification

Gajcin, Jasmina, McCarthy, James, Nair, Rahul, Marinescu, Radu, Daly, Elizabeth, Dusparic, Ivana

arXiv.org Artificial IntelligenceAug-30-2023

A well-defined reward function is crucial for successful training of an reinforcement learning (RL) agent. However, defining a suitable reward function is a notoriously challenging task, especially in complex, multi-objective environments. Developers often have to resort to starting with an initial, potentially misspecified reward function, and iteratively adjusting its parameters, based on observed learned behavior. In this work, we aim to automate this process by proposing ITERS, an iterative reward shaping approach using human feedback for mitigating the effects of a misspecified reward function. Our approach allows the user to provide trajectory-level feedback on agent's behavior during training, which can be integrated as a reward shaping signal in the following training iteration. We also allow the user to provide explanations of their feedback, which are used to augment the feedback and reduce user effort and feedback frequency. We evaluate ITERS in three environments and show that it can successfully correct misspecified reward functions.

artificial intelligence, machine learning, reinforcement learning, (3 more...)

arXiv.org Artificial Intelligence

2308.15969

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)

Add feedback

Contrastive Explanations for Comparing Preferences of Reinforcement Learning Agents

Gajcin, Jasmina, Nair, Rahul, Pedapati, Tejaswini, Marinescu, Radu, Daly, Elizabeth, Dusparic, Ivana

arXiv.org Artificial IntelligenceDec-17-2021

In complex tasks where the reward function is not straightforward and consists of a set of objectives, multiple reinforcement learning (RL) policies that perform task adequately, but employ different strategies can be trained by adjusting the impact of individual objectives on reward function. Understanding the differences in strategies between policies is necessary to enable users to choose between offered policies, and can help developers understand different behaviors that emerge from various reward functions and training hyperparameters in RL systems. In this work we compare behavior of two policies trained on the same task, but with different preferences in objectives. We propose a method for distinguishing between differences in behavior that stem from different abilities from those that are a consequence of opposing preferences of two RL agents. Furthermore, we use only data on preference-based differences in order to generate contrasting explanations about agents' preferences. Finally, we test and evaluate our approach on an autonomous driving task and compare the behavior of a safety-oriented policy and one that prefers speed.

artificial intelligence, machine learning, reinforcement learning agent, (1 more...)

arXiv.org Artificial Intelligence

2112.09462

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Designing Machine Learning Pipeline Toolkit for AutoML Surrogate Modeling Optimization

Palmes, Paulito P., Kishimoto, Akihiro, Marinescu, Radu, Ram, Parikshit, Daly, Elizabeth

arXiv.org Artificial IntelligenceJul-13-2021

The pipeline optimization problem in machine learning requires simultaneous optimization of pipeline structures and parameter adaptation of their elements. Having an elegant way to express these structures can help lessen the complexity in the management and analysis of their performances together with the different choices of optimization strategies. With these issues in mind, we created the AutoMLPipeline (AMLP) toolkit which facilitates the creation and evaluation of complex machine learning pipeline structures using simple expressions. We use AMLP to find optimal pipeline signatures, datamine them, and use these datamined features to speed-up learning and prediction. We formulated a two-stage pipeline optimization with surrogate modeling in AMLP which outperforms other AutoML approaches with a 4-hour time budget in less than 5 minutes of AMLP computation time.

artificial intelligence, optimization problem, pipeline, (18 more...)

arXiv.org Artificial Intelligence

2107.01253

Country:

North America > United States (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.40)

Industry: Information Technology (0.31)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)

Add feedback

Computing Multi-Modal Journey Plans under Uncertainty

Botea, Adi, Kishimoto, Akihiro, Nikolova, Evdokia, Braghin, Stefano, Berlingerio, Michele, Daly, Elizabeth

Journal of Artificial Intelligence ResearchAug-16-2019

Multi-modal journey planning, which allows multiple types of transport within a single trip, is becoming increasingly popular, due to a strong practical interest and an increasing availability of data. In real life, transport networks feature uncertainty. Yet, most approaches assume a deterministic environment, making plans more prone to failures such as missed connections and major delays in the arrival. This paper presents an approach to computing optimal contingent plans in multi-modal journey planning. The problem is modeled as a search in an and/or state space. We describe search enhancements used on top of the AO* algorithm. Enhancements include admissible heuristics, multiple types of pruning that preserve the completeness and the optimality, and a hybrid search approach with a deterministic and a nondeterministic search. We demonstrate an NP-hardness result, with the hardness stemming from the dynamically changing distributions of the travel time random variables. We perform a detailed empirical analysis on realistic transport networks from cities such as Montpellier, Rome and Dublin. The results demonstrate the effectiveness of our algorithmic contributions, and the benefits of contingent plans as compared to standard sequential plans, when the arrival and departure times of buses are characterized by uncertainty.

artificial intelligence, machine learning, travel time, (20 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.11422

AI Access Foundation

11422

Journal of Artificial Intelligence Research

Country:

Europe > France > Occitanie > Hérault > Montpellier (0.25)
North America > United States > Texas > Travis County (0.14)
North America > United States > California > San Francisco County (0.14)

Genre: Research Report > New Finding (0.66)

Industry:

Transportation > Infrastructure & Services (1.00)
Leisure & Entertainment (0.92)
Transportation > Ground > Road (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
(2 more...)

Add feedback

Generating Dialogue Agents via Automated Planning

Botea, Adi, Muise, Christian, Agarwal, Shubham, Alkan, Oznur, Bajgar, Ondrej, Daly, Elizabeth, Kishimoto, Akihiro, Lastras, Luis, Marinescu, Radu, Ondrej, Josef, Pedemonte, Pablo, Vodolan, Miroslav

arXiv.org Artificial IntelligenceFeb-2-2019

Dialogue systems have many applications such as customer support or question answering. Typically they have been limited to shallow single turn interactions. However more advanced applications such as career coaching or planning a trip require a much more complex multi-turn dialogue. Current limitations of conversational systems have made it difficult to support applications that require personalization, customization and context dependent interactions. We tackle this challenging problem by using domain-independent AI planning to automatically create dialogue plans, customized to guide a dialogue towards achieving a given goal. The input includes a library of atomic dialogue actions, an initial state of the dialogue, and a goal. Dialogue plans are plugged into a dialogue system capable to orchestrate their execution. Use cases demonstrate the viability of the approach. Our work on dialogue planning has been integrated into a product, and it is in the process of being deployed into another.

artificial intelligence, dialogue, planning & scheduling, (20 more...)

arXiv.org Artificial Intelligence

1902.00771

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry: Consumer Products & Services > Travel (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.93)

Add feedback