A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI