AITopics | Agarwal, Divyansh

Collaborating Authors

Agarwal, Divyansh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BingoGuard: LLM Content Moderation Tools with Risk Levels

Yin, Fan, Laban, Philippe, Peng, Xiangyu, Zhou, Yilun, Mao, Yixin, Vats, Vaibhav, Ross, Linnea, Agarwal, Divyansh, Xiong, Caiming, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceMar-9-2025

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

large language model, machine learning, severity level, (20 more...)

arXiv.org Artificial Intelligence

2503.0655

Country: North America > United States > California (0.14)

Genre:

Instructional Material (1.00)
Research Report (0.82)

Industry:

Media > News (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Cultural and Social Awareness of LLM Web Agents

Qiu, Haoyi, Fabbri, Alexander R., Agarwal, Divyansh, Huang, Kung-Hsiang, Tan, Sarah, Peng, Nanyun, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceOct-30-2024

As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages -- fine-tuning on culture-specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.23252

Country:

Europe (0.93)
Asia > Middle East (0.46)
Africa > Middle East > Egypt (0.14)
North America > United States > California (0.14)

Genre:

Research Report (0.64)
Workflow (0.46)

Industry:

Law (1.00)
Government > Immigration & Customs (0.67)
Information Technology > Services > e-Commerce Services (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions

Agarwal, Divyansh, Fabbri, Alexander R., Laban, Philippe, Risher, Ben, Joty, Shafiq, Xiong, Caiming, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceApr-26-2024

Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.16251

Country:

North America > United States (0.67)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Art or Artifice? Large Language Models and the False Promise of Creativity

Chakrabarty, Tuhin, Laban, Philippe, Agarwal, Divyansh, Muresan, Smaranda, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceSep-25-2023

Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2309.14556

Country:

North America > United States > Hawaii (0.17)
North America > United States > New York (0.15)
North America > United States > Louisiana (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Education > Curriculum > Subject-Specific Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Laban, Philippe, Kryściński, Wojciech, Agarwal, Divyansh, Fabbri, Alexander R., Xiong, Caiming, Joty, Shafiq, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceMay-23-2023

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.

benchmark, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.1454

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.46)
Media > News (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation

Meng, Rui, Liu, Ye, Yavuz, Semih, Agarwal, Divyansh, Tu, Lifu, Yu, Ning, Zhang, Jianguo, Bhat, Meghana, Zhou, Yingbo

arXiv.org Artificial IntelligenceMar-7-2023

Dense retrievers have made significant strides in text retrieval and open-domain question answering, even though most achievements were made possible only with large amounts of human supervision. In this work, we aim to develop unsupervised methods by proposing two methods that create pseudo query-document pairs and train dense retrieval models in an annotation-free and scalable manner: query extraction and transferred query generation. The former method produces pseudo queries by selecting salient spans from the original document. The latter utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that models trained with the proposed augmentation methods can perform comparably well (or better) to multiple strong baselines. Combining those strategies leads to further improvements, achieving the state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.

machine learning, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

2212.08841

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.68)
Materials > Chemicals (0.46)
Government (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.88)

Add feedback

BookSum: A Collection of Datasets for Long-form Narrative Summarization

Kryściński, Wojciech, Rajani, Nazneen, Agarwal, Divyansh, Xiong, Caiming, Radev, Dragomir

arXiv.org Artificial IntelligenceDec-6-2022

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

dashwood, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2105.08209

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.45)

Industry:

Law (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing

Agarwal, Divyansh, Fabbri, Alexander R., Han, Simeng, Kryściński, Wojciech, Ladhak, Faisal, Li, Bryan, McKeown, Kathleen, Radev, Dragomir, Zhang, Tianyi, Wiseman, Sam

arXiv.org Artificial IntelligenceDec-6-2022

This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work in the field.

artificial intelligence, natural language, summarization, (13 more...)

arXiv.org Artificial Intelligence

2211.05886

Country:

North America > United States (1.00)
Europe (0.68)

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment (0.69)
Media > Television (0.36)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Semblance: A Rank-Based Kernel on Probability Spaces for Niche Detection

Agarwal, Divyansh, Zhang, Nancy

arXiv.org Machine LearningAug-6-2018

Kernel methods provide a principled approach for detecting nonlinear relations using well understood linear algorithms. In exploratory data analyses when the underlying structure of the data's probability space is unclear, the choice of kernel is often arbitrary. Here, we present a novel kernel, Semblance, on a probability feature space. The advantage of Semblance lies in its distribution free formulation and its ability to detect niche features by placing greater emphasis on similarity between observation pairs that fall at the tail ends of a distribution, as opposed to those that fall towards the mean. We prove that Semblance is a valid Mercer kernel and illustrate its applicability through simulations and real world examples.

artificial intelligence, health & medicine, semblance, (18 more...)

arXiv.org Machine Learning

1808.02061

Country:

North America > United States > Pennsylvania (0.29)
Europe > United Kingdom > England (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback