Dynamic benchmarking framework for LLM-based conversational data capture