Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models