Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models