Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models