Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement