LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation