Automated test generation to evaluate tool-augmented LLMs as conversational AI agents