Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification