Statistical multi-metric evaluation and visualization of LLM system predictive performance