Nonlinear ISA with Auxiliary Variables for Learning Speech Representations