Disentangled Acoustic Fields For Multimodal Physical Scene Understanding