Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents