BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery