Benchmarking Reliability of Deep Learning Models for Pathological Gait Classification