Evaluating Consistency and Reasoning Capabilities of Large Language Models