Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment