$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment