On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research