Clustering
Can Explainable AI Assess Personalized Health Risks from Indoor Air Pollution?
Sarkar, Pritisha, Jala, Kushalava reddy, Saha, Mousumi
Acknowledging the effects of outdoor air pollution, the literature inadequately addresses indoor air pollution's impacts. Despite daily health risks, existing research primarily focused on monitoring, lacking accuracy in pinpointing indoor pollution sources. In our research work, we thoroughly investigated the influence of indoor activities on pollution levels. A survey of 143 participants revealed limited awareness of indoor air pollution. Leveraging 65 days of diverse data encompassing activities like incense stick usage, indoor smoking, inadequately ventilated cooking, excessive AC usage, and accidental paper burning, we developed a comprehensive monitoring system. We identify pollutant sources and effects with high precision through clustering analysis and interpretability models (LIME and SHAP). Our method integrates Decision Trees, Random Forest, Naive Bayes, and SVM models, excelling at 99.8% accuracy with Decision Trees. Continuous 24-hour data allows personalized assessments for targeted pollution reduction strategies, achieving 91% accuracy in predicting activities and pollution exposure.
RAHN: A Reputation Based Hourglass Network for Web Service QoS Prediction
Chen, Xia, Du, Yugen, Tang, Guoxing, Luo, Yingwei, Ma, Benchi
As the homogenization of Web services becomes more and more common, the difficulty of service recommendation is gradually increasing. How to predict Quality of Service (QoS) more efficiently and accurately becomes an important challenge for service recommendation. Considering the excellent role of reputation and deep learning (DL) techniques in the field of QoS prediction, we propose a reputation and DL based QoS prediction network, RAHN, which contains the Reputation Calculation Module (RCM), the Latent Feature Extraction Module (LFEM), and the QoS Prediction Hourglass Network (QPHN). RCM obtains the user reputation and the service reputation by using a clustering algorithm and a Logit model. LFEM extracts latent features from known information to form an initial latent feature vector. QPHN aggregates latent feature vectors with different scales by using Attention Mechanism, and can be stacked multiple times to obtain the final latent feature vector for prediction. We evaluate RAHN on a real QoS dataset. The experimental results show that the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) of RAHN are smaller than the six baseline methods.
Coarsened confounding for causal effects: a large-sample framework
There has been widespread use of causal inference methods for the rigorous analysis of observational studies and to identify policy evaluations. In this article, we consider coarsened exact matching, developed in Iacus et al. (2011). While they developed some statistical properties, in this article, we study the approach using asymptotics based on a superpopulation inferential framework. This methodology is generalized to what we termed as coarsened confounding, for which we propose two new algorithms. We develop asymptotic results for the average causal effect estimator as well as providing conditions for consistency. In addition, we provide an asymptotic justification for the variance formulae in Iacus et al. (2011). A bias correction technique is proposed, and we apply the proposed methodology to data from two well-known observational studies.
Chameleon2++: An Efficient Chameleon2 Clustering with Approximate Nearest Neighbors
Singh, Priyanshu, Ahuja, Kapil
Clustering algorithms are fundamental tools in data analysis, with hierarchical methods being particularly valuable for their flexibility. Chameleon is a widely used hierarchical clustering algorithm that excels at identifying high-quality clusters of arbitrary shapes, sizes, and densities. Chameleon2 is the most recent variant that has demonstrated significant improvements, but suffers from critical failings and there are certain improvements that can be made. The first failure we address is that the complexity of Chameleon2 is claimed to be $O(n^2)$, while we demonstrate that it is actually $O(n^2\log{n})$, with $n$ being the number of data points. Furthermore, we suggest improvements to Chameleon2 that ensure that the complexity remains $O(n^2)$ with minimal to no loss of performance. The second failing of Chameleon2 is that it lacks transparency and it does not provide the fine-tuned algorithm parameters used to obtain the claimed results. We meticulously provide all such parameter values to enhance replicability. The improvement which we make in Chameleon2 is that we replace the exact $k$-NN search with an approximate $k$-NN search. This further reduces the algorithmic complexity down to $O(n\log{n})$ without any performance loss. Here, we primarily configure three approximate nearest neighbor search algorithms (Annoy, FLANN and NMSLIB) to align with the overarching Chameleon2 clustering framework. Experimental evaluations on standard benchmark datasets demonstrate that the proposed Chameleon2++ algorithm is more efficient, robust, and computationally optimal.
Signal Recovery Using a Spiked Mixture Model
Delacour, Paul-Louis, Wahls, Sander, Spraggins, Jeffrey M., Migas, Lukasz, Van de Plas, Raf
We introduce the spiked mixture model (SMM) to address the problem of estimating a set of signals from many randomly scaled and noisy observations. Subsequently, we design a novel expectation-maximization (EM) algorithm to recover all parameters of the SMM. Numerical experiments show that in low signal-to-noise ratio regimes, and for data types where the SMM is relevant, SMM surpasses the more traditional Gaussian mixture model (GMM) in terms of signal recovery performance. The broad relevance of the SMM and its corresponding EM recovery algorithm is demonstrated by applying the technique to different data types. The first case study is a biomedical research application, utilizing an imaging mass spectrometry dataset to explore the molecular content of a rat brain tissue section at micrometer scale. The second case study demonstrates SMM performance in a computer vision application, segmenting a hyperspectral imaging dataset into underlying patterns. While the measurement modalities differ substantially, in both case studies SMM is shown to recover signals that were missed by traditional methods such as k-means clustering and GMM.
MRG: A Multi-Robot Manufacturing Digital Scene Generation Method Using Multi-Instance Point Cloud Registration
Han, Songjie, Liu, Yinhua, Li, Yanzheng, Chen, Hua, Yang, Dongmei
A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud "segmentation-registration" generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.
Deep Clustering via Community Detection
Deep clustering is an essential task in modern artificial intelligence, aiming to partition a set of data samples into a given number of homogeneous groups (i.e., clusters). Even though many Deep Neural Network (DNN) backbones and clustering strategies have been proposed for the task, achieving increasingly improved performance, deep clustering remains very challenging due to the lack of accurately labeled samples. In this paper, we propose a novel approach of deep clustering via community detection. It initializes clustering by detecting many communities, and then gradually expands clusters by community merging. Compared with the existing clustering strategies, community detection factors in the new perspective of cluster network analysis. As a result, it has the inherent benefit of high pseudo-label purity, which is critical to the performance of self-supervision. We have validated the efficacy of the proposed approach on benchmark image datasets. Our extensive experiments have shown that it can effectively improve the SOTA performance. Our ablation study also demonstrates that the new network perspective can effectively improve community pseudo-label purity, resulting in improved clustering performance.
LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data
Zhang, Yuxin, Chen, Haoyu, Lin, Zheng, Chen, Zhe, Zhao, Jin
Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) by organizing edge devices with similar data distributions into clusters, enabling collaborative model training tailored to each group. However, existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training, which leads to suboptimal performance. Moreover, traditional clustering methods incur significant computational overhead, especially as the number of edge devices increases. In this paper, we propose LCFed, an efficient CFL framework to combat these challenges. By leveraging model partitioning and adopting distinct aggregation strategies for each sub-model, LCFed effectively incorporates global knowledge into intra-cluster co-training, achieving optimal training performance. Additionally, LCFed customizes a computationally efficient model similarity measurement method based on low-rank models, enabling real-time cluster updates with minimal computational overhead. Extensive experiments show that LCFed outperforms state-of-the-art benchmarks in both test accuracy and clustering computational efficiency.
Recommender systems and reinforcement learning for human-building interaction and context-aware support: A text mining-driven review of scientific literature
Zhang, Wenhao, Quintana, Matias, Miller, Clayton
The indoor environment significantly impacts human health and well-being; enhancing health and reducing energy consumption in these settings is a central research focus. With the advancement of Information and Communication Technology (ICT), recommendation systems and reinforcement learning (RL) have emerged as promising approaches to induce behavioral changes to improve the indoor environment and energy efficiency of buildings. This study aims to employ text mining and Natural Language Processing (NLP) techniques to thoroughly examine the connections among these approaches in the context of human-building interaction and occupant context-aware support. The study analyzed 27,595 articles from the ScienceDirect database, revealing extensive use of recommendation systems and RL for space optimization, location recommendations, and personalized control suggestions. Furthermore, this review underscores the vast potential for expanding recommender systems and RL applications in buildings and indoor environments. Fields ripe for innovation include predictive maintenance, building-related product recommendation, and optimization of environments tailored for specific needs, such as sleep and productivity enhancements based on user feedback. The study also notes the limitations of the method in capturing subtle academic nuances. Future improvements could involve integrating and fine-tuning pre-trained language models to better interpret complex texts.
Text Clustering as Classification with LLMs
Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.