lesswrong
Large Language Model Alignment: A Survey
Shen, Tianhao, Jin, Renren, Huang, Yufei, Liu, Chuang, Dong, Weilong, Guo, Zishan, Wu, Xinwei, Liu, Yan, Xiong, Deyi
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > Middle East > Jordan (0.04)
- (23 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.92)
- Law Enforcement & Public Safety (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (3 more...)
Why does advanced AI want not to be shut down? - LessWrong
I've always been pretty confused about this. The standard AI risk scenarios usually (though I think not always) suppose that advanced AI wants not to be shut down. As commonly framed, the AI will fool humanity into believing it is aligned so as not to be turned off, until - all at once - it destroys humanity and gains control over all earth's resources. But why does the AI want not to be shut down? The motivation behind a human wanting not to die comes from evolution.
Eliezer is still ridiculously optimistic about AI risk - LessWrong
They actually take his arguments seriously. If I wanted to blow my life savings on some wretched crypto scam I'd certainly listen to these guys about what the best scam to fall for was. This is what it looks like when the great hero of humanity, who has always been remarkably genre-savvy, realises that the movie he's in is'Lovecraft-style Existential Cosmic Horror', rather than'Rationalist Harry Potter Fanfic'. All power to Eliezer for having had a go. What sort of fool gives up before he's actually lost?
- Media > Film (0.56)
- Leisure & Entertainment (0.56)
Low impact agency: review and discussion
The problem of artificial intelligence safety can be seen as can be seen as ensuring an agent with the power of causing harm chooses to not do so. In the limit, the agent can be powerful enough that causing existential catastrophe is within its limit, and it has incentives to doing so [6], so our task is to guarantee that it chooses not to. A possible approach is penalize changes in the world caused by agent, leading to the agent not causing catastrophe because that leads to large changes in the world[24]. The hope is that this is a relatively easy objective to align the agent with, as opposed to aligning it with the full range of human values. So, our desideratum is that the AI achieves something while doing as little in the world as possible .
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Roko's basilisk - Wikipedia
Roko's basilisk is a thought experiment which states that an otherwise benevolent artificial superintelligence (AI) in the future would be incentivized to create a virtual reality simulation to torture anyone who knew of its potential existence but did not directly contribute to its advancement or development.[1][2] It originated in a 2010 post at discussion board LessWrong, a technical forum focused on analytical rational enquiry.[1][3][4] The thought experiment's name derives from the poster of the article (Roko) and the basilisk, a mythical creature capable of destroying enemies with its stare. While the theory was initially dismissed as nothing but conjecture or speculation by many LessWrong users, LessWrong co-founder Eliezer Yudkowsky reported users who described symptoms such as nightmares and mental breakdowns upon reading the theory, due to its stipulation that knowing about the theory and its basilisk made you vulnerable to the basilisk itself.[1][5] This led to discussion of the basilisk on the site to be banned for five years.[1][6]
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > California (0.04)
- Information Technology > Communications > Collaboration (0.70)
- Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.55)
- Information Technology > Communications > Social Media (0.50)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.47)
The Most Terrifying Thought Experiment of All Time
WARNING: Reading this article may commit you to an eternity of suffering and torment. These are some of the urban legends spawned by the Internet. Yet none is as all-powerful and threatening as Roko's Basilisk. For Roko's Basilisk is an evil, godlike form of artificial intelligence, so dangerous that if you see it, or even think about it too hard, you will spend the rest of eternity screaming in its torture chamber. Even death is no escape, for if you die, Roko's Basilisk will resurrect you and begin the torture again.
What's the actual evidence that AI marketing tools are changing preferences in a way that makes them easier to predict? - LessWrong
The issue is that different AI models still produce similar results in type-casting the audience. For example, Facebook, YouTube, and Spotify will recommend equivalent recommendations based on your history (most likely the most recent history is weighted). As audiences are more and more type-casted they are offered products and services based on the same principles. Until AI models include some kind of "wild card" factor the results will be homogenous. The fact you micro-segment and therefore see better advertising results does not indicate changing behaviors, the opposite is true - when un-related segments can be converted then AI can be said to change preferences.
Future ML Systems Will Be Qualitatively Different - LessWrong
In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. While some of the examples, like uranium, correspond to a sharp transition, others like specialization are more continuous. I'll use emergence to refer to qualitative changes that arise from quantitative increases in scale, and phase transitions for cases where the change is sharp.
The Diversity of Argument-Making in the Wild: from Assumptions and Definitions to Causation and Anecdote in Reddit's "Change My View"
What kinds of arguments do people make, and what effect do they have on others? Normative constraints on argument-making are as old as philosophy itself, but little is known about the diversity of arguments made in practice. We use NLP tools to extract patterns of argument-making from the Reddit site "Change My View" (r/CMV). This reveals six distinct argument patterns: not just the familiar deductive and inductive forms, but also arguments about definitions, relevance, possibility and cause, and personal experience. Data from r/CMV also reveal differences in efficacy: personal experience and, to a lesser extent, arguments about causation and examples, are most likely to shift a person's view, while arguments about relevance are the least. Finally, our methods reveal a gradient of argument-making preferences among users: a two-axis model, of "personal--impersonal" and "concrete--abstract", can account for nearly 80% of the strategy variance between individuals.
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- (3 more...)
- Government (1.00)
- Law (0.69)
- Media > News (0.61)
Looking for someone to run an online seminar on human learning - LessWrong
I'm looking for someone with a background in education and/or cognitive science to run an online seminar for non-rationalists on how humans learn things and how to efficiently teach a subject to others. A few examples of the sort of content I'm thinking of are: Ebbinghaus's research on memory, spaced repetition, the difference between shallow and deep learning of a subject. The exact content would be up to you. It would be a 1 hour seminar on May 29th, run via Zoom or a similar platform. If you're interested, please email me to discuss the details.