Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Open in new window