Goto

Collaborating Authors

 Clustering


Open-world Semi-supervised Novel Class Discovery

arXiv.org Artificial Intelligence

Traditional semi-supervised learning tasks assume that both labeled and unlabeled data follow the same class distribution, but the realistic open-world scenarios are of more complexity with unknown novel classes mixed in the unlabeled set. Therefore, it is of great challenge to not only recognize samples from known classes but also discover the unknown number of novel classes within the unlabeled data. In this paper, we introduce a new open-world semi-supervised novel class discovery approach named OpenNCD, a progressive bi-level contrastive learning method over multiple prototypes. The proposed method is composed of two reciprocally enhanced parts. First, a bi-level contrastive learning method is introduced, which maintains the pair-wise similarity of the prototypes and the prototype group levels for better representation learning. Then, a reliable prototype similarity metric is proposed based on the common representing instances. Prototypes with high similarities will be grouped progressively for known class recognition and novel class discovery. Extensive experiments on three image datasets are conducted and the results show the effectiveness of the proposed method in open-world scenarios, especially with scarce known classes and labels.


funLOCI: a local clustering algorithm for functional data

arXiv.org Machine Learning

Nowadays, more and more problems are dealing with data with one infinite continuous dimension: functional data. In this paper, we introduce the funLOCI algorithm which allows to identify functional local clusters or functional loci, i.e., subsets/groups of functions exhibiting similar behaviour across the same continuous subset of the domain. The definition of functional local clusters leverages ideas from multivariate and functional clustering and biclustering and it is based on an additive model which takes into account the shape of the curves. funLOCI is a three-step algorithm based on divisive hierarchical clustering. The use of dendrograms allows to visualize and to guide the searching procedure and the cutting thresholds selection. To deal with the large quantity of local clusters, an extra step is implemented to reduce the number of results to the minimum.


Exploring and Exploiting Data Heterogeneity in Recommendation

arXiv.org Artificial Intelligence

Massive amounts of data are the foundation of data-driven recommendation models. As an inherent nature of big data, data heterogeneity widely exists in real-world recommendation systems. It reflects the differences in the properties among sub-populations. Ignoring the heterogeneity in recommendation data could limit the performance of recommendation models, hurt the sub-populational robustness, and make the models misled by biases. However, data heterogeneity has not attracted substantial attention in the recommendation community. Therefore, it inspires us to adequately explore and exploit heterogeneity for solving the above problems and assisting data analysis. In this work, we focus on exploring two representative categories of heterogeneity in recommendation data that is the heterogeneity of prediction mechanism and covariate distribution and propose an algorithm that explores the heterogeneity through a bilevel clustering method. Furthermore, the uncovered heterogeneity is exploited for two purposes in recommendation scenarios which are prediction with multiple sub-models and supporting debias. Extensive experiments on real-world data validate the existence of heterogeneity in recommendation data and the effectiveness of exploring and exploiting data heterogeneity in recommendation.


Detection of Interacting Variables for Generalized Linear Models via Neural Networks

arXiv.org Machine Learning

The quality of generalized linear models (GLMs), frequently used by insurance companies, depends on the choice of interacting variables. The search for interactions is time-consuming, especially for data sets with a large number of variables, depends much on expert judgement of actuaries, and often relies on visual performance indicators. Therefore, we present an approach to automating the process of finding interactions that should be added to GLMs to improve their predictive power. Our approach relies on neural networks and a model-specific interaction detection method, which is computationally faster than the traditionally used methods like Friedman H-Statistic or SHAP values. In numerical studies, we provide the results of our approach on artificially generated data as well as open-source data.


GFDC: A Granule Fusion Density-Based Clustering with Evidential Reasoning

arXiv.org Artificial Intelligence

Currently, density-based clustering algorithms are widely applied because they can detect clusters with arbitrary shapes. However, they perform poorly in measuring global density, determining reasonable cluster centers or structures, assigning samples accurately and handling data with large density differences among clusters. To overcome their drawbacks, this paper proposes a granule fusion density-based clustering with evidential reasoning (GFDC). Both local and global densities of samples are measured by a sparse degree metric first. Then information granules are generated in high-density and low-density regions, assisting in processing clusters with significant density differences. Further, three novel granule fusion strategies are utilized to combine granules into stable cluster structures, helping to detect clusters with arbitrary shapes. Finally, by an assignment method developed from Dempster-Shafer theory, unstable samples are assigned. After using GFDC, a reasonable clustering result and some identified outliers can be obtained. The experimental results on extensive datasets demonstrate the effectiveness of GFDC.


Transfer operators on graphs: Spectral clustering and beyond

arXiv.org Artificial Intelligence

Graphs and networks play an important role in modeling and analyzing complex interconnected systems such as transportation networks, integrated circuits, power grids, citation graphs, and biological and artificial neural networks. Graph clustering algorithms can be used to detect groups of strongly connected vertices and to derive coarse-grained models. We define transfer operators such as the Koopman operator and the Perron-Frobenius operator on graphs, study their spectral properties, introduce Galerkin projections of these operators, and illustrate how reduced representations can be estimated from data. In particular, we show that spectral clustering of undirected graphs can be interpreted in terms of eigenfunctions of the Koopman operator and propose novel clustering algorithms for directed graphs based on generalized transfer operators. We demonstrate the efficacy of the resulting algorithms on several benchmark problems and provide different interpretations of clusters.


PORTRAIT: a hybrid aPproach tO cReate extractive ground-TRuth summAry for dIsaster evenT

arXiv.org Artificial Intelligence

Disaster summarization approaches provide an overview of the important information posted during disaster events on social media platforms, such as, Twitter. However, the type of information posted significantly varies across disasters depending on several factors like the location, type, severity, etc. Verification of the effectiveness of disaster summarization approaches still suffer due to the lack of availability of good spectrum of datasets along with the ground-truth summary. Existing approaches for ground-truth summary generation (ground-truth for extractive summarization) relies on the wisdom and intuition of the annotators. Annotators are provided with a complete set of input tweets from which a subset of tweets is selected by the annotators for the summary. This process requires immense human effort and significant time. Additionally, this intuition-based selection of the tweets might lead to a high variance in summaries generated across annotators. Therefore, to handle these challenges, we propose a hybrid (semi-automated) approach (PORTRAIT) where we partly automate the ground-truth summary generation procedure. This approach reduces the effort and time of the annotators while ensuring the quality of the created ground-truth summary. We validate the effectiveness of PORTRAIT on 5 disaster events through quantitative and qualitative comparisons of ground-truth summaries generated by existing intuitive approaches, a semi-automated approach, and PORTRAIT. We prepare and release the ground-truth summaries for 5 disaster events which consist of both natural and man-made disaster events belonging to 4 different countries. Finally, we provide a study about the performance of various state-of-the-art summarization approaches on the ground-truth summaries generated by PORTRAIT using ROUGE-N F1-scores.


Neural Capacitated Clustering

arXiv.org Artificial Intelligence

Recent work on deep clustering has found new promising methods also for constrained clustering problems. Their typically pairwise constraints often can be used to guide the partitioning of the data. Many problems however, feature cluster-level constraints, e.g. the Capacitated Clustering Problem (CCP), where each point has a weight and the total weight sum of all points in each cluster is bounded by a prescribed capacity. In this paper we propose a new method for the CCP, Neural Capacited Clustering, that learns a neural network to predict the assignment probabilities of points to cluster centers from a data set of optimal or near optimal past solutions of other problem instances. During inference, the resulting scores are then used in an iterative k-means like procedure to refine the assignment under capacity constraints. In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and heuristic solvers from the literature. Moreover, we apply our method in the context of a cluster-first-route-second approach to the Capacitated Vehicle Routing Problem (CVRP) and show competitive results on the well-known Uchoa benchmark.


Incomplete Multi-view Clustering via Diffusion Completion

arXiv.org Artificial Intelligence

Incomplete multi-view clustering is a challenging and non-trivial task to provide effective data analysis for large amounts of unlabeled data in the real world. All incomplete multi-view clustering methods need to address the problem of how to reduce the impact of missing views. To address this issue, we propose diffusion completion to recover the missing views integrated into an incomplete multi-view clustering framework. Based on the observable views information, the diffusion model is used to recover the missing views, and then the consistency information of the multi-view data is learned by contrastive learning to improve the performance of multi-view clustering. To the best of our knowledge, this may be the first work to incorporate diffusion models into an incomplete multi-view clustering framework. Experimental results show that the proposed method performs well in recovering the missing views while achieving superior clustering performance compared to state-of-the-art methods.


Free Lunch for Privacy Preserving Distributed Graph Learning

arXiv.org Artificial Intelligence

Learning on graphs is becoming prevalent in a wide range of applications including social networks, robotics, communication, medicine, etc. These datasets belonging to entities often contain critical private information. The utilization of data for graph learning applications is hampered by the growing privacy concerns from users on data sharing. Existing privacy-preserving methods pre-process the data to extract user-side features, and only these features are used for subsequent learning. Unfortunately, these methods are vulnerable to adversarial attacks to infer private attributes. We present a novel privacy-respecting framework for distributed graph learning and graph-based machine learning. In order to perform graph learning and other downstream tasks on the server side, this framework aims to learn features as well as distances without requiring actual features while preserving the original structural properties of the raw data. The proposed framework is quite generic and highly adaptable. We demonstrate the utility of the Euclidean space, but it can be applied with any existing method of distance approximation and graph learning for the relevant spaces. Through extensive experimentation on both synthetic and real datasets, we demonstrate the efficacy of the framework in terms of comparing the results obtained without data sharing to those obtained with data sharing as a benchmark. This is, to our knowledge, the first privacy-preserving distributed graph learning framework.