Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network