Human-Calibrated Automated Testing and Validation of Generative Language Models

Sudjianto, Agus, Zhang, Aijun, Neppalli, Srinivas, Joshi, Tarun, Malohlava, Michal

Dec-7-2024–arXiv.org Artificial Intelligence

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-7-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)

Genre:
- Research Report (0.64)
- Overview (0.46)

Industry:
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (0.93)
- Law (0.88)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Processing (0.68)
    - Large Language Model (0.68)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (0.34)