Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments