Towards In-context Scene Understanding