NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?