On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models