Semi-supervised clustering for de-duplication