Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models