JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles

Oct-14-2024–arXiv.org Artificial Intelligence

We introduce JurEE, an ensemble of efficient, encoder-only transformer models designed to strengthen safeguards in AI-User interactions within LLM-based systems. Unlike existing LLM-as-Judge methods, which often struggle with generalization across risk taxonomies and only provide textual outputs, JurEE offers probabilistic risk estimates across a wide range of prevalent risks. Our approach leverages diverse data sources and employs progressive synthetic data generation techniques, including LLM-assisted augmentation, to enhance model robustness and performance. We create an in-house benchmark comprising of other reputable benchmarks such as the OpenAI Moderation Dataset and ToxicChat, where we find JurEE significantly outperforms baseline models, demonstrating superior accuracy, speed, and cost-efficiency. This makes it particularly suitable for applications requiring stringent content moderation, such as customer-facing chatbots. The encoder-ensemble's modular design allows users to set tailored risk thresholds, enhancing its versatility across various safety-related applications. JurEE's collective decision-making process, where each specialized encoder model contributes to the final output, not only improves predictive accuracy but also enhances interpretability. This approach provides a more efficient, performant, and economical alternative to traditional LLMs for large-scale implementations requiring robust content moderation.

application, evaluation, language model, (14 more...)

arXiv.org Artificial Intelligence

Oct-14-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Michigan > Washtenaw County > Ann Arbor (0.04)
- Europe > Portugal
  - Lisbon > Lisbon (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)