What AI evaluations for preventing catastrophic risks can and cannot do