Analyzing Probabilistic Methods for Evaluating Agent Capabilities