Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning