Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study