A Supplementary Material
–Neural Information Processing Systems
Table 6: Mercury-eval encompasses 256 tasks, the difficulty of which has been balanced for model evaluation. Specifically, there are two primary constraints: a time limit and a memory limit. The memory limit caps the amount of RAM that a process can consume. The sandbox employs an isolated file system to provide a safe execution environment for the code. This is done to prevent code from using certain functions directly from the host's system libraries, which could result in unpredictable behavior Figure 5 shows the overview of the code execution pipeline.
Neural Information Processing Systems
Sep-25-2025, 07:18:13 GMT