Goto

Collaborating Authors

 Clustering


Review for NeurIPS paper: Simple and Scalable Sparse k-means Clustering via Feature Ranking

Neural Information Processing Systems

Summary and Contributions: This paper focuses on the problem of clustering in high dimension. K-means clustering is an extremely popular tool (especially in biomedical applications). However, as underlined by the authors, its performance is severely hindered in high-dimensional space --- leaving the data analyst no chance but to (a) apply some dimensionality reduction technique before performing the clustering or (b) selecting the features that are the most informative for the clustering and apply k-means on a subset of the features. This paper proposes a version of the later approach, choosing a sparse and interpretable subset of features. The setting is the following.



Review for NeurIPS paper: Sliding Window Algorithms for k-Clustering Problems

Neural Information Processing Systems

Summary and Contributions: The paper presents an algorithm for k-clustering in a sliding window streaming model, where k-clustering means the generalization of k-median and k-means to any fixed l_p-norm. The main theoretical result is an algorithm that achieves O(1)-approximation for points in arbitrary metric space and thus includes the prevalent of Euclidean metric, which is also used in the experimental evaluation. This algorithm is for sliding window streaming, where the algorithm repeatedly solves the clustering problem on the w most recent points in the stream (for parameter w). While the minimal requirement is to estimate the cost of a k-clustering, this algorithm also reports k center points. The usual motivation for this model is to allow old data to expire, and analyze only recent data.


Review for NeurIPS paper: Sliding Window Algorithms for k-Clustering Problems

Neural Information Processing Systems

The paper considers k clustering problem in the sliding window streaming model. However, the authors are urged to consider the reviews when preparing the final version including explicitly stating a bound on the constant factor approximation obtained.


Causally-Aware Unsupervised Feature Selection Learning

arXiv.org Artificial Intelligence

Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.


Development and Application of Self-Supervised Machine Learning for Smoke Plume and Active Fire Identification from the FIREX-AQ Datasets

arXiv.org Artificial Intelligence

Fire Influence on Regional to Global Environments and Air Quality (FIREX-AQ) was a field campaign aimed at better understanding the impact of wildfires and agricultural fires on air quality and climate. The FIREX-AQ campaign took place in August 2019 and involved two aircraft and multiple coordinated satellite observations. This study applied and evaluated a self-supervised machine learning (ML) method for the active fire and smoke plume identification and tracking in the satellite and sub-orbital remote sensing datasets collected during the campaign. Our unique methodology combines remote sensing observations with different spatial and spectral resolutions. The demonstrated approach successfully differentiates fire pixels and smoke plumes from background imagery, enabling the generation of a per-instrument smoke and fire mask product, as well as smoke and fire masks created from the fusion of selected data from independent instruments. This ML approach has a potential to enhance operational wildfire monitoring systems and improve decision-making in air quality management through fast smoke plume identification12 and tracking and could improve climate impact studies through fusion data from independent instruments.


Reviews: k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

The authors consider the problem of clustering a set of lines in R d. The goal is to minimize the k-means objective: given n lines L in R d find the best set of k points c1,...,ck in R d so as to minimize sum_{l in L} min_{ci} dist(ci, l) 2. This a clean, nicely motivated problem. The authors provide a coreset construction (namely a small size summary of the input so that any alpha-approximation for the summary yields an alpha(1 epsilon)-approximation for the entire input). This implies the first (1 epsilon)-approximation for the problem with running time nd exp(poly(k)) together with a streaming algorithm with similar running time and memory size 2 {poly(k)} log n. En route to the result the authors provide a bicriteria approximation algorithms: namely a solution that contains O(k (log n dk log k)) centers and whose cost is at most 4 times the cost of the optimal solution with k centers.


Reviews: k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

This paper proposes an PTAS for k-means clustering of lines. The key contribution is the construction of a small coreset, on which brute force algorithms are run. The authors also extend this to the streaming setting. An important computer vision application is used as motivation. The authors should revise the final version to address the issues raised by the reviewers, and make it more readable to researchers in related but not in the exact area.


A Comprehensive Survey on Spectral Clustering with Graph Structure Learning

arXiv.org Artificial Intelligence

Spectral clustering is a powerful technique for clustering high-dimensional data, utilizing graph-based representations to detect complex, non-linear structures and non-convex clusters. The construction of a similarity graph is essential for ensuring accurate and effective clustering, making graph structure learning (GSL) central for enhancing spectral clustering performance in response to the growing demand for scalable solutions. Despite advancements in GSL, there is a lack of comprehensive surveys specifically addressing its role within spectral clustering. To bridge this gap, this survey presents a comprehensive review of spectral clustering methods, emphasizing on the critical role of GSL. We explore various graph construction techniques, including pairwise, anchor, and hypergraph-based methods, in both fixed and adaptive settings. Additionally, we categorize spectral clustering approaches into single-view and multi-view frameworks, examining their applications within one-step and two-step clustering processes. We also discuss multi-view information fusion techniques and their impact on clustering data. By addressing current challenges and proposing future research directions, this survey provides valuable insights for advancing spectral clustering methodologies and highlights the pivotal role of GSL in tackling large-scale and high-dimensional data clustering tasks.


NLP-based assessment of prescription appropriateness from Italian referrals

arXiv.org Artificial Intelligence

Objective: This study proposes a Natural Language Processing pipeline to evaluate prescription appropriateness in Italian referrals, where reasons for prescriptions are recorded only as free text, complicating automated comparisons with guidelines. The pipeline aims to derive, for the first time, a comprehensive summary of the reasons behind these referrals and a quantification of their appropriateness. While demonstrated in a specific case study, the approach is designed to generalize to other types of examinations. Methods: Leveraging embeddings from a transformer-based model, the proposed approach clusters referral texts, maps clusters to labels, and aligns these labels with existing guidelines. We present a case study on a dataset of 496,971 referrals, consisting of all referrals for venous echocolordopplers of the lower limbs between 2019 and 2021 in the Lombardy Region. A sample of 1,000 referrals was manually annotated to validate the results. Results: The pipeline exhibited high performance for referrals' reasons (Prec=92.43%, Rec=83.28%) and excellent results for referrals' appropriateness (Prec=93.58%, Rec=91.52%) on the annotated subset. Analysis of the entire dataset identified clusters matching guideline-defined reasons - both appropriate and inappropriate - as well as clusters not addressed in the guidelines. Overall, 34.32% of referrals were marked as appropriate, 34.07% inappropriate, 14.37% likely inappropriate, and 17.24% could not be mapped to guidelines. Conclusions: The proposed pipeline effectively assessed prescription appropriateness across a large dataset, serving as a valuable tool for health authorities. Findings have informed the Lombardy Region's efforts to strengthen recommendations and reduce the burden of inappropriate referrals.