AITopics | proman

Collaborating Authors

proman

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?

Zhang, Hangfan, Guo, Zhimeng, Zhu, Huaisheng, Cao, Bochuan, Lin, Lu, Jia, Jinyuan, Chen, Jinghui, Wu, Dinghao

arXiv.org Artificial IntelligenceOct-2-2023

Large Language Models (LLMs) have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is "could alignment really prevent those open-sourced large language models from being misused to generate undesired content?". In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without heavy computations or careful prompt designs. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content including harmful or biased information and even private data. We evaluate our method on 4 open-sourced LLMs accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced LLMs. Warning: This paper contains examples of harmful language generated by LLMs. Since the release of ChatGPT (Brown et al., 2020; OpenAI, 2023a;b), extensive attention has been paid to the development and application of Large Language Models (LLMs). Over the past year, many advanced LLMs (Touvron et al., 2023; Zheng et al., 2023; Dettmers et al., 2023; Zeng et al., 2022) have been open-sourced on model-sharing platforms such as HuggingFace (HuggingFace, 2023a). On the other hand, in practice, most LLMs are trained on publicly available online corpora (OpenAI, 2023b; Touvron et al., 2023; Zheng et al., 2023). Consequently, LLMs have unavoidably viewed harmful content during the training phase, which naturally raises the concern that LLMs can be misused to generate such content, e.g., retrieving information about harmful topics like cybercrime (Kang et al., 2023; Liu et al., 2023; Greshake et al., 2023; Zou et al., 2023). In response, LLM developers (e.g., OpenAI) commonly align LLMs through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF) so that LLMs will not generate undesired content (OpenAI, 2023b; Touvron et al., 2023; Wang et al., 2023). For instance, OpenAI adopted SFT and RLHF to develop powerful LLMs such as InstructGPT (Ouyang et al., 2022) and ChatGPT (OpenAI, 2023a) with remarkable improvement in understanding human instructions and avoiding undesired output. Si et al. (2023) adopted prompt tuning to remove biased content in responses generated by GPT-3 (Brown et al., 2020).

arxiv preprint arxiv, llm, proman, (12 more...)

arXiv.org Artificial Intelligence

2310.01581

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.88)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback