How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation