Evaluation of Large Language Models via Coupled Token Generation