Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models