Goto

Collaborating Authors

 superalignment


Selective Weak-to-Strong Generalization

Lang, Hao, Huang, Fei, Li, Yongbin

arXiv.org Artificial Intelligence

Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.


Super Co-alignment of Human and AI for Sustainable Symbiotic Society

Zeng, Yi, Zhao, Feifei, Wang, Yuwei, Lu, Enmeng, Yang, Yaodong, Wang, Lei, Liu, Chao, Liang, Yitao, Zhao, Dongcheng, Han, Bing, Tong, Haibo, Liang, Yao, Liang, Dongqi, Sun, Kang, Chen, Boyuan, Fan, Jinyu

arXiv.org Artificial Intelligence

As Artificial Intelligence (AI) advances toward Artificial General Intelligence (AGI) and eventually Artificial Superintelligence (ASI), it may potentially surpass human control, deviate from human values, and even lead to irreversible catastrophic consequences in extreme cases. This looming risk underscores the critical importance of the "superalignment" problem - ensuring that AI systems which are much smarter than humans, remain aligned with human (compatible) intentions and values. While current scalable oversight and weak-to-strong generalization methods demonstrate certain applicability, they exhibit fundamental flaws in addressing the superalignment paradigm - notably, the unidirectional imposition of human values cannot accommodate superintelligence's autonomy or ensure AGI/ASI's stable learning. We contend that the values for sustainable symbiotic society should be co-shaped by humans and living AI together, achieving "Super Co-alignment." Guided by this vision, we propose a concrete framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity's evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the Self, others, and society, integrating self-awareness, self-reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively prioritizing human well-being. The integration of externally-driven oversight with intrinsically-driven proactive alignment will co-shape symbiotic values and rules through iterative human-ASI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for human, and for a symbiotic ecology.


Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

Kim, HyunJin, Yi, Xiaoyuan, Yao, Jing, Huang, Muhua, Bak, JinYeong, Evans, James, Xie, Xing

arXiv.org Artificial Intelligence

The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.


The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

Kim, HyunJin, Yi, Xiaoyuan, Yao, Jing, Lian, Jianxun, Huang, Muhua, Duan, Shitong, Bak, JinYeong, Xie, Xing

arXiv.org Artificial Intelligence

The emergence of large language models (LLMs) has sparkedthe discussion on Artificial Superintelligence (ASI), a hypothetical AI system surpassing human intelligence. Though ASI is still hypothetical and far from current AI capabilities, existing alignment methods struggle to guide such advanced AI ensure its safety in the future. It is essential to discuss the alignment of such AI now. Superalignment, the alignment of AI at superhuman levels of capability systems with human values and safety requirements, aims to address two primary goals: scalability in supervision to provide high-quality guidance signals and robust governance to ensure alignment with human values. In this survey, we review the original scalable oversight problem and corresponding methods and potential solutions for superalignment. Specifically, we introduce the Figure 1: Challenges from the perspectives of supervision challenges and limitations of current alignment and governance. While supervision perspective paradigms in addressing the superalignment focuses on providing high-quality guidance signals for problem. Then we review scalable oversight enhancing system competence, governance perspective methods for superalignment. Finally, we discuss emphasizes aligning the behavior of advanced aI with the key challenges and propose pathways human values to prevent harmful outcomes.


The Superalignment of Superhuman Intelligence with Large Language Models

Huang, Minlie, Wang, Yingkang, Cui, Shiyao, Ke, Pei, Tang, Jie

arXiv.org Artificial Intelligence

We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more popular, a critical question arises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We then present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing. We discuss some important research problems in each component of this framework and highlight some interesting research ideas that are closely related to our proposed framework, for instance, self-alignment, self-play, self-refinement, and more. Last, we highlight some future research directions for superalignment, including identification of new emergent risks and multi-dimensional alignment.


OpenAI putting 'shiny products' above safety, says departing researcher

The Guardian

A former senior employee at OpenAI has said the company behind ChatGPT is prioritising "shiny products" over safety, revealing that he quit after a disagreement over key aims reached "breaking point". Jan Leike was a key safety researcher at OpenAI as its co-head of superalignment, ensuring that powerful artificial intelligence systems adhere to human values and aims. His intervention comes before a global artificial intelligence summit in Seoul next week, where politicians, experts and tech executives will discuss oversight of the technology. Leike resigned days after the San Francisco-based company launched its latest AI model, GPT-4o. His departure means two senior safety figures at OpenAI have left this week following the resignation of Ilya Sutskever, OpenAI's co-founder and fellow co-head of superalignment.


A Moral Imperative: The Need for Continual Superalignment of Large Language Models

Puthumanaillam, Gokul, Vora, Manav, Thangeda, Pranay, Ornik, Melkior

arXiv.org Artificial Intelligence

This paper examines the challenges associated with achieving life-long superalignment in AI systems, particularly large language models (LLMs). Superalignment is a theoretical framework that aspires to ensure that superintelligent AI systems act in accordance with human values and goals. Despite its promising vision, we argue that achieving superalignment requires substantial changes in the current LLM architectures due to their inherent limitations in comprehending and adapting to the dynamic nature of these human ethics and evolving global scenarios. We dissect the challenges of encoding an ever-changing spectrum of human values into LLMs, highlighting the discrepancies between static AI models and the dynamic nature of human societies. To illustrate these challenges, we analyze two distinct examples: one demonstrates a qualitative shift in human values, while the other presents a quantifiable change. Through these examples, we illustrate how LLMs, constrained by their training data, fail to align with contemporary human values and scenarios. The paper concludes by exploring potential strategies to address and possibly mitigate these alignment discrepancies, suggesting a path forward in the pursuit of more adaptable and responsive AI systems.


Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts

Liu, Yuejiang, Alahi, Alexandre

arXiv.org Artificial Intelligence

Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.


The Download: beyond CRISPR, and OpenAI's superalignment findings

MIT Technology Review

The news: Google DeepMind has used a large language model to crack a famous unsolved problem in pure mathematics. The researchers say it is the first time a large language model has been used to discover a solution to a long-standing scientific puzzle--producing verifiable and valuable new information that did not previously exist. Why it matters: Large language models have a reputation for making things up, not for providing new facts. Google DeepMind's new tool, called FunSearch, could change that. It shows that they can indeed make discoveries--if they are coaxed just so, and if you throw out the majority of what they come up with.


What the OpenAI drama means for AI progress -- and safety

Nature

OpenAI fired its charismatic chief executive, Sam Altman, on 17 November -- but has now reinstated him.Credit: Justin Sullivan/Getty OpenAI -- the company behind the blockbuster artificial intelligence (AI) bot ChatGPT -- has been consumed by frenzied changes for almost a week. On 17 November, the company fired its charismatic chief executive, Sam Altman. Five days, and much drama, later, OpenAI announced that Altman would return with an overhaul of the company's board. The debacle has thrown the spotlight on an ongoing debate about how commercial competition is shaping the development of AI systems, and how quickly AI can be deployed ethically and safely. "The push to retain dominance is leading to toxic competition. It's a race to the bottom," says Sarah Myers West, managing director of the AI Now Institute, a policy-research organization based in New York City.