Language Models can Evaluate Themselves via Probability Discrepancy