The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs