Prompt attacks: are LLM jailbreaks inevitable?
Out of all the Large Language Models (LLMs) currently out in the open, I've found Claude to be by far the safest and most harmless one. The team at Anthropic, a cutting-edge AI startup valued at $4B, has done an absolutely brilliant job taking AI Safety to the next level with Claude and Claude, using a slew of ingenuous techniques like RLAIF and a proprietary approach called "Constitutional AI", to turn their models into "helpful, honest, and harmless" AI systems. Through hundreds of experiments covering all the typical attempts of circumventing an LLM's safety restrictions, I can confidently confirm that Claude blows the competition out of the water on AI safety -- yes, that includes GPT-4 (and Bard, in case anyone still cares about that guy). But as can be seen from the snippet of my chat with Claude above (and as we will see in much more detail below), the road to a fully safe AI system is still long and arduous. The problem for LLMs is compounded by the fact that much of their impressive capabilities are emergent at scale, and that AI Interpretability Research is still pretty much an open field when it comes to the "black box" problem.
Mar-28-2023, 06:40:29 GMT
- Technology: