Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing