Towards a statistical theory of data selection under weak supervision