Benchmarking Failures in Tool-Augmented Language Models