Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory