AITopics

Country: Asia > China (0.14)

Genre: Research Report (0.69)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Neural Information Processing SystemsFeb-17-2026, 20:22:30 GMT

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Fine-tuning services for Large Language Models (LLMs) have emerged as a new paradigm.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Banking & Finance (0.93)
Information Technology > Security & Privacy (0.67)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsFeb-17-2026, 14:30:46 GMT

e197fe307eb3467035f892dc100d570a-Supplemental-Conference.pdf

The process for calculating these metrics is described in Appendix C. Moreover, to ensure the comparability between prediction performance metrics and driving performance metrics in the radar plot, we normalize all metrics to the scale of [0, 1]. In the subsequent section, we provide an overview of the DESPOT planner. These two values can only be inferred from history. The safety is represented by the normalized collision rate.

artificial intelligence, despot planner, machine learning, (17 more...)

Country:

Asia > India (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

arXiv.org Artificial IntelligenceNov-3-2025

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Hu, Zixuan, Shen, Li, Wang, Zhenyi, Wei, Yongxian, Tao, Dacheng

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2510.27172

Country: Asia (0.45)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.67)
Government > Military (0.67)

arXiv.org Artificial IntelligenceOct-14-2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Liu, Guozhi, Mu, Qi, Huang, Tiansheng, Wang, Xinhua, Shen, Li, Lin, Weiwei, Li, Zhang

Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60\% and 3.30\%, respectively, and enhances inference performance by 3.50\% and 1.10\%. Notably, it reduces training time by 56.83\% and 57.63\%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.

large language model, machine learning, natural language, (17 more...)

2510.10085

Country: Asia > China > Guangdong Province (0.14)

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Vaccines (0.74)
Health & Medicine > Therapeutic Area > Immunology (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsOct-11-2025, 00:38:16 GMT

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Fine-tuning services for Large Language Models (LLMs) have emerged as a new paradigm.

arxiv preprint arxiv, dataset, fine-tuning, (14 more...)

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Banking & Finance (0.93)
Information Technology > Security & Privacy (0.67)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Machine LearningMar-24-2025

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Chen, Pin-Yu, Shen, Han, Das, Payel, Chen, Tianyi

Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

large language model, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2503.20807

Country: North America > United States > New York (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-19-2024

ProSec: Fortifying Code LLMs with Proactive Security Alignment

Xu, Xiangzhe, Su, Zian, Guo, Jinyao, Zhang, Kaiyuan, Wang, Zhenting, Zhang, Xiangyu

Recent advances in code-specific large language models (LLMs) have greatly enhanced code generation and refinement capabilities. However, the safety of code LLMs remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Previous work proposes to collect security-focused instruction-tuning dataset from real-world vulnerabilities. It is constrained by the data sparsity of vulnerable code, and has limited applicability in the iterative post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing error-inducing coding scenarios from Common Weakness Enumerations (CWEs), and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through advanced preference learning objectives. The scenarios synthesized by ProSec triggers 25 times more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7 times larger than the previous work. Experiments show that models trained with ProSec is 29.2% to 35.5% more secure compared to previous work, with a marginal negative effect of less than 2 percentage points on model's utility.

large language model, machine learning, natural language, (16 more...)

2411.12882

Country:

North America > United States (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceOct-18-2024

Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment

Shao, Zedian, Liu, Hongbin, Mu, Jaden, Gong, Neil Zhenqiang

In a prompt injection attack, an attacker injects a prompt into the original one, aiming to make the LLM follow the injected prompt and perform a task chosen by the attacker. Existing prompt injection attacks primarily focus on how to blend the injected prompt into the original prompt without altering the LLM itself. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we show that an attacker can boost the success of prompt injection attacks by poisoning the LLM's alignment process. Specifically, we propose PoisonedAlign, a method to strategically create poisoned alignment samples. When even a small fraction of the alignment data is poisoned using our method, the aligned LLM becomes more vulnerable to prompt injection while maintaining its foundational capabilities. The code is available at https://github.com/Sadcardation/PoisonedAlign

large language model, machine learning, natural language, (20 more...)

2410.14827

Country: Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Huang, Chao-Wei, Chen, Yun-Nung

FactAlign: Long-form Factuality Alignment of Large Language Models

arXiv.org Artificial IntelligenceOct-2-2024

Large language models have demonstrated significant potential as the next-generation information access engines. However, their reliability is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework designed to enhance the factuality of LLMs' long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more information without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly available at https://github.com/MiuLab/FactAlign

factuality, hague convention, language model, (10 more...)

2410.01691

Country:

Europe > Netherlands > South Holland > The Hague (0.06)
Asia > Singapore (0.04)
Europe > Bosnia and Herzegovina (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry: Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)