Intrinsic Evaluation of RAG Systems for Deep-Logic Questions