Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting