Estimating Jaccard Index with Missing Observations: A Matrix Calibration Approach

Wenye Li

Neural Information Processing Systems 

The Jaccard index is a standard statistics for comparing the pairwise similarity between data samples. This paper investigates the problem of e stimating a Jaccard index matrix when there are missing observations in data sam ples. Starting from a Jaccard index matrix approximated from the incomplete dat a, our method calibrates the matrix to meet the requirement of positive semi-d efiniteness and other constraints, through a simple alternating projection algo rithm. Compared with conventional approaches that estimate the similarity matr ix based on the imputed data, our method has a strong advantage in that the calibrate d matrix is guaranteed to be closer to the unknown ground truth in the Frobenius norm than the un-calibrated matrix (except in special cases they are iden tical). We carried out a series of empirical experiments and the results confirmed ou r theoretical justification. The evaluation also reported significantly improved r esults in real learning tasks on benchmark datasets.