A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals