Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws