EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

Open in new window