Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions