Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection