With the huge influx of various data nowadays, extracting knowledge from them has become an interesting but tedious task among data scientists, particularly when the data come in heterogeneous form and have missing information. Many data completion techniques had been introduced, especially in the advent of kernel methods. However, among the many data completion techniques available in the literature, studies about mutually completing several incomplete kernel matrices have not been given much attention yet. In this paper, we present a new method, called Mutual Kernel Matrix Completion (MKMC) algorithm, that tackles this problem of mutually inferring the missing entries of multiple kernel matrices by combining the notions of data fusion and kernel matrix completion, applied on biological data sets to be used for classification task. We first introduced an objective function that will be minimized by exploiting the EM algorithm, which in turn results to an estimate of the missing entries of the kernel matrices involved. The completed kernel matrices are then combined to produce a model matrix that can be used to further improve the obtained estimates. An interesting result of our study is that the E-step and the M-step are given in closed form, which makes our algorithm efficient in terms of time and memory. After completion, the (completed) kernel matrices are then used to train an SVM classifier to test how well the relationships among the entries are preserved. Our empirical results show that the proposed algorithm bested the traditional completion techniques in preserving the relationships among the data points, and in accurately recovering the missing kernel matrix entries. By far, MKMC offers a promising solution to the problem of mutual estimation of a number of relevant incomplete kernel matrices.
A matrix completion problem, which aims to recover a complete matrix from its partial observations, is one of the important problems in the machine learning field and has been studied actively. However, there is a discrepancy between the mainstream problem setting, which assumes continuous-valued observations, and some practical applications such as recommendation systems and SNS link predictions where observations take discrete or even binary values. To cope with this problem, Davenport et al. (2014) proposed a binary matrix completion (BMC) problem, where observations are quantized into binary values. Hsieh et al. (2015) proposed a PU (Positive and Unlabeled) matrix completion problem, which is an extension of the BMC problem. This problem targets the setting where we cannot observe negative values, such as SNS link predictions. In the construction of their method for this setting, they introduced a methodology of the classification problem, regarding each matrix entry as a sample. Their risk, which defines losses over unobserved entries as well, indicates the possibility of the use of unobserved entries. In this paper, motivated by a semi-supervised classification method recently proposed by Sakai et al. (2017), we develop a method for the BMC problem which can use all of positive, negative, and unobserved entries, by combining the risks of Davenport et al. (2014) and Hsieh et al. (2015). To the best of our knowledge, this is the first BMC method which exploits all kinds of matrix entries. We experimentally show that an appropriate mixture of risks improves the performance.
Zhou, Joey Tianyi (Institute of High Performance Computing) | Pan, Sinno Jialin (Nanyang Technological University) | Tsang, Ivor W. (University of Technology) | Ho, Shen-Shyang (Nanyang Technological University)
Most existing heterogeneous transfer learning (HTL) methods for cross-language text classification rely on sufficient cross-domain instance correspondences to learn a mapping across heterogeneous feature spaces, and assume that such correspondences are given in advance. However, in practice, correspondences between domains are usually unknown. In this case, extensively manual efforts are required to establish accurate correspondences across multilingual documents based on their content and meta-information. In this paper, we present a general framework to integrate active learning to construct correspondences between heterogeneous domains for HTL, namely HTL through active correspondences construction (HTLA). Based on this framework, we develop a new HTL method. On top of the new HTL method, we further propose a strategy to actively construct correspondences between domains. Extensive experiments are conducted on various multilingual text classification tasks to verify the effectiveness of HTLA.
Matrix completion problems are the problems of recovering missing entries in a partially observed high dimensional matrix with or without noise. Such a problem is encountered in a wide range of applications such as collaborative filtering, global positioning and remote sensing. Most of the existing matrix completion algorithms assume a low rank structure of the underlying complete matrix and perform reconstruction through the recovery of the low-rank structure using singular value decomposition. In this paper, we propose an alternative and more flexible structure for the underlying true complete matrix for the purpose of matrix completion and denoising. Specifically, instead of assuming a low matrix rank, we assume the underlying complete matrix has a low Kronecker product rank structure. Such a structure is often seen in the matrix observations in signal processing and image processing applications. The Kronecker product structure also includes low rank singular value decomposition structure commonly used as one of its special cases. The extra flexibility assumed for the underlying structure allows for using much less number of parameters but also raises the challenge of determining the proper Kronecker product configuration to be used. In this article, we propose to use a class of information criteria for the determination of the proper configuration and study its empirical performance in matrix completion problems. Simulation studies show promising results that the true underlying configuration can be accurately selected by the information criteria and the accompanying matrix completion algorithm can produce more accurate matrix recovery with less number of parameters than the standard matrix completion algorithms.
Recent studies utilize multiple kernel learning to deal with incomplete-data problem. In this study, we introduce new methods that do not only complete multiple incomplete kernel matrices simultaneously, but also allow control of the flexibility of the model by parameterizing the model matrix. By imposing restrictions on the model covariance, overfitting of the data is avoided. A limitation of kernel matrix estimations done via optimization of an objective function is that the positive definiteness of the result is not guaranteed. In view of this limitation, our proposed methods employ the LogDet divergence, which ensures the positive definiteness of the resulting inferred kernel matrix. We empirically show that our proposed restricted covariance models, employed with LogDet divergence, yield significant improvements in the generalization performance of previous completion methods.