AITopics | content

Industry:

Law (1.00)
Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsNov-18-2025, 08:42:34 GMT

Part I Appendix Table of Contents

Table 9) and identifying reflections (Error #20 in Table 13) are also noted.

artificial intelligence, large language model, natural language, (18 more...)

Country:

North America > Canada > Ontario (0.05)
South America > Chile (0.04)
North America > United States > Texas (0.04)

Genre: Collection (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Neural Information Processing SystemsOct-9-2025, 07:58:54 GMT

cf04d01a0e76f8b13095349d9caca033-Supplemental-Conference.pdf

artificial intelligence, deep learning, machine learning, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Neural Information Processing SystemsOct-9-2025, 01:51:39 GMT

Appendix T able of Contents

We provide the guidelines presented to the users for the creation of the dataset. To see some examples of how the guidelines can be applied, visit the examples document. You can use it to rate each guideline and leave feedback for each task. The user should be allowed to refuse to give up any information. Ask the user to elaborate or rephrase instead.

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States (0.14)
Europe > Germany (0.14)

Genre: Questionnaire & Opinion Survey (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsAug-17-2025, 09:01:37 GMT

Appendix T able of Contents

C.2 Proof of item 2 for constrained Markov game Here we construct a counter example to prove item 2. Consider a constrained Markov game with

artificial intelligence, nulla, stochastic modification, (17 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

arXiv.org Artificial IntelligenceMar-17-2025

Reliable and Efficient Amortized Model-based Evaluation

Truong, Sang, Tu, Yuheng, Liang, Percy, Li, Bo, Koyejo, Sanmi

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.

large language model, machine learning, natural language, (19 more...)

2503.13335

Country:

North America > United States > Missouri (0.05)
Oceania > Australia (0.04)
Europe > France (0.04)
(13 more...)

Genre: Research Report (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMar-9-2025

BingoGuard: LLM Content Moderation Tools with Risk Levels

Yin, Fan, Laban, Philippe, Peng, Xiangyu, Zhou, Yilun, Mao, Yixin, Vats, Vaibhav, Ross, Linnea, Agarwal, Divyansh, Xiong, Caiming, Wu, Chien-Sheng

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

classification, content, severity level, (15 more...)

2503.0655

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre:

Instructional Material (1.00)
Research Report (0.82)

Industry:

Media > News (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-28-2025

An Empirical Analysis of LLMs for Countering Misinformation

Proma, Adiba Mahbub, Pate, Neeley, Druckman, James, Ghoshal, Gourab, He, Hangfeng, Hoque, Ehsan

While Large Language Models (LLMs) can amplify online misinformation, they also show promise in tackling misinformation. In this paper, we empirically study the capabilities of three LLMs -- ChatGPT, Gemini, and Claude -- in countering political misinformation. We implement a two-step, chain-of-thought prompting approach, where models first identify credible sources for a given claim and then generate persuasive responses. Our findings suggest that models struggle to ground their responses in real news sources, and tend to prefer citing left-leaning sources. We also observe varying degrees of response diversity among models. Our findings highlight concerns about using LLMs for fact-checking through only prompt-engineering, emphasizing the need for more robust guardrails. Our results have implications for both researchers and non-technical users.

llm, news source, phase system content, (13 more...)

2503.01902

Country:

North America > United States (1.00)
North America > Canada (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

arXiv.org Artificial IntelligenceFeb-13-2025

Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models

Zou, Qingsong, Xiao, Jingyu, Li, Qing, Yan, Zhi, Wang, Yuhang, Xu, Li, Wang, Wenxuan, Gao, Kuofeng, Li, Ruoyu, Jiang, Yong

Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to systematically examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into code-style structured query to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, ant the results show that QueryAttack achieves high attack success rates (ASRs) across LLMs with different developers and capabilities. We also evaluate QueryAttack's performance against common defenses, confirming that it is difficult to mitigate with general defensive techniques. To defend against QueryAttack, we tailor a defense method which can reduce ASR by up to 64\% on GPT-4-1106. The code of QueryAttack can be found on https://anonymous.4open.science/r/QueryAttack-334B.

large language model, machine learning, natural language, (19 more...)

2502.09723

Country:

Europe > Austria > Vienna (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)
(11 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology > Security & Privacy (1.00)
Consumer Products & Services > Travel (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Atil, Berk, Gupta, Vipul, Das, Sarkar Snigdha Sarathi, Passonneau, Rebecca J.

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

arXiv.org Artificial IntelligenceFeb-7-2025

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.

large language model, machine learning, natural language, (20 more...)