A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth