Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

May-28-2025, 16:18:05 GMT–Neural Information Processing Systems

As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

May-28-2025, 16:18:05 GMT

Conferences PDF

Add feedback

Country:
- Asia (1.00)
- Europe > Austria
  - Vienna (0.14)
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Washington > King County
    - Seattle (0.14)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)