AITopics | beavertail

Collaborating Authors

beavertail

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Neural Information Processing SystemsDec-25-2025, 03:31:54 GMT

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs).

beavertail, improved safety alignment, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Ghosh, Shaona, Bhattacharjee, Amrita, Ziser, Yftah, Parisien, Christopher

arXiv.org Artificial IntelligenceJun-6-2025

Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.0425

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment

Li, Moxin, Zhang, Yuantao, Wang, Wenjie, Shi, Wentao, Liu, Zhuo, Feng, Fuli, Chua, Tat-Seng

arXiv.org Artificial IntelligenceFeb-20-2025

Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at \url{https://github.com/zyttt-coder/SIPO}.

objective, pareto-optimal response, preference conflict, (15 more...)

arXiv.org Artificial Intelligence

2502.14354

Country:

Europe > Austria > Vienna (0.15)
North America > United States > Florida > Miami-Dade County > Miami (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Education (0.68)
Health & Medicine > Therapeutic Area (0.67)
Health & Medicine > Health Care Providers & Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Out-of-Distribution Detection using Synthetic Data Generation

Abbas, Momin, Azmat, Muneeza, Horesh, Raya, Yurochkin, Mikhail

arXiv.org Artificial IntelligenceFeb-5-2025

Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.03323

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Balepur, Nishant, Padmakumar, Vishakh, Yang, Fumeng, Feng, Shi, Rudinger, Rachel, Boyd-Graber, Jordan Lee

arXiv.org Artificial IntelligenceJan-20-2025

LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)-abductively inferring personas of users who prefer chosen or rejected outputs-and Persona Tailoring (PT)-training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.

large language model, machine learning, persona, (17 more...)

arXiv.org Artificial Intelligence

2501.11549

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
Asia > Singapore (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(12 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area > Environmental Medicine > Snake Bites (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Neural Information Processing SystemsJan-17-2025, 20:06:31 GMT

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails.

beavertail, human-preference dataset, improved safety alignment, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Correcting Large Language Model Behavior via Influence Function

Zhang, Han, Zhang, Zhuo, Zhang, Yi, Zhai, Yuanzhao, Peng, Hanyang, Lei, Yu, Yu, Yue, Wang, Hui, Liang, Bin, Gui, Lin, Xu, Ruifeng

arXiv.org Artificial IntelligenceDec-20-2024

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an Influence function-driven Bregman Optimization (IBO) technique to adjust the model's behavior based on these influence distributions. Our experiments demonstrate that LANCET effectively and efficiently correct inappropriate behaviors of LLMs. Furthermore, LANCET can outperform methods that rely on collecting human preferences, and it enhances the interpretability of learning human preferences within LLMs.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.16451

Country:

Asia > China (0.46)
North America (0.46)

Genre:

Research Report > Promising Solution (0.66)
Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Towards Inference-time Category-wise Safety Steering for Large Language Models

Bhattacharjee, Amrita, Ghosh, Shaona, Rebedea, Traian, Parisien, Christopher

arXiv.org Artificial IntelligenceOct-1-2024

While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inferencetime safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices. Content Warning: This paper contains examples of harmful language.

activation, arxiv preprint arxiv, category, (15 more...)

arXiv.org Artificial Intelligence

2410.01174

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Arizona (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.64)

Industry:

Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Finding Safety Neurons in Large Language Models

Chen, Jianhui, Wang, Xiaozhi, Yao, Zijun, Bai, Yushi, Hou, Lei, Li, Juanzi

arXiv.org Artificial IntelligenceJun-20-2024

Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.

activation, neuron, safety neuron, (14 more...)

arXiv.org Artificial Intelligence

2406.14144

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Ji, Jiaming, Hong, Donghai, Zhang, Borong, Chen, Boyuan, Dai, Josef, Zheng, Boren, Qiu, Tianyi, Li, Boxun, Yang, Yaodong

arXiv.org Artificial IntelligenceJun-20-2024

In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.

arxiv preprint arxiv, category, information, (12 more...)

arXiv.org Artificial Intelligence

2406.15513

Country:

North America > United States > Texas (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Indonesia > Bali (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law > Criminal Law (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback