Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet