Goto

Collaborating Authors

 Clustering


Towards efficient compression and communication for prototype-based decentralized learning

arXiv.org Artificial Intelligence

In prototype-based federated learning, the exchange of model parameters between clients and the master server is replaced by transmission of prototypes or quantized versions of the data samples to the aggregation server. A fully decentralized deployment of prototype-based learning, without a central agregartor of prototypes, is more robust upon network failures and reacts faster to changes in the statistical distribution of the data, suggesting potential advantages and quick adaptation in dynamic learning tasks, e.g., when the data sources are IoT devices or when data is non-iid. In this paper, we consider the problem of designing a communication-efficient decentralized learning system based on prototypes. We address the challenge of prototype redundancy by leveraging on a twofold data compression technique, i.e., sending only update messages if the prototypes are informationtheoretically useful (via the Jensen-Shannon distance), and using clustering on the prototypes to compress the update messages used in the gossip protocol. We also use parallel instead of sequential gossiping, and present an analysis of its age-of-information (AoI). Our experimental results show that, with these improvements, the communications load can be substantially reduced without decreasing the convergence rate of the learning algorithm. Federated Learning (FL) [1], [2], [3] and Decentralized Federated Learning (DFL) [4], [5] provide good approaches for distributed machine learning system where the main focus is the minimization of a global loss function using different versions of a model created by multiple clients. These approaches have been extensively studied in the literature and applied, traditionally, to process private data in areas such as health and banking. In this paper, differently to these well-known approaches, we focus on the analysis and implementation of a decentralized machine learning system based on prototypes. On the one hand, our choice of prototype-based algorithms is motivated by the advantages of these prototypes as compact representation of the data, capturing the essential features and patterns within the dataset.


Comparative Evaluation of Clustered Federated Learning Methods

arXiv.org Machine Learning

Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.


The Femininomenon of Inequality: A Data-Driven Analysis and Cluster Profiling in Indonesia

arXiv.org Artificial Intelligence

This study addresses the persistent challenges of Workplace Gender Equality (WGE) in Indonesia, examining regional disparities in gender empowerment and inequality through the Gender Empowerment Index (IDG) and Gender Inequality Index (IKG). Despite Indonesia's economic growth and incremental progress in gender equality, as indicated by improvements in the IDG and IKG scores from 2018 to 2023, substantial regional differences remain. Utilizing k-means clustering, the study identifies two distinct clusters of regions with contrasting gender profiles. Cluster 0 includes regions like DKI Jakarta and Central Java, characterized by higher gender empowerment and lower inequality, while Cluster 1 comprises areas such as Papua and North Maluku, where gender disparities are more pronounced. The analysis reveals that local socio-economic conditions and governance frameworks play a critical role in shaping regional gender dynamics. Correlation analyses further demonstrate that higher empowerment is generally associated with lower inequality and greater female representation in professional roles. These findings underscore the importance of targeted, region-specific interventions to promote WGE, addressing both structural and cultural barriers. The insights provided by this study aim to guide policymakers in developing tailored strategies to foster gender equality and enhance women's participation in the workforce across Indonesia's diverse regions.


Clustered Patch Embeddings for Permutation-Invariant Classification of Whole Slide Images

arXiv.org Artificial Intelligence

In the evolving field of digital pathology, Whole Slide Imaging (WSI) has emerged as a transformative technology, enabling the digitization of histopathological slides at gigapixel resolution. This advancement has not only facilitated remote diagnostics and educational opportunities but also opened new avenues for quantitative image analysis [1, 2]. Despite its potential, the sheer size and complexity of WSIs pose significant computational challenges, limiting the practicality of large-scale analysis and the application of advanced machine learning techniques [3, 4]. Whole slide imaging (WSI) represents a significant breakthrough in digital pathology, enabling the digitization of histological slides at high resolutions. This advancement allows for improved visualization, analysis, and management of tissue samples, essential for accurate disease diagnosis and research. However, the sheer size and complexity of WSIs pose unique challenges in image processing and analysis, necessitating innovative approaches for efficient and effective feature extraction and classification. Traditional methods for analyzing WSIs often rely on supervised learning techniques, which require extensive annotated datasets prepared by expert pathologists. This process is not only time-consuming but also prone to variability due to inter-observer differences.


Unsupervised Parameter-free Outlier Detection using HDBSCAN* Outlier Profiles

arXiv.org Artificial Intelligence

In machine learning and data mining, outliers are data points that significantly differ from the dataset and often introduce irrelevant information that can induce bias in its statistics and models. Therefore, unsupervised methods are crucial to detect outliers if there is limited or no information about them. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN*, a state-of-the-art hierarchical clustering method. GLOSH estimates outlier scores for each data point by comparing its density to the highest density of the region they reside in the HDBSCAN* hierarchy. GLOSH may be sensitive to HDBSCAN*'s minpts parameter that influences density estimation. With limited knowledge about the data, choosing an appropriate minpts value beforehand is challenging as one or some minpts values may better represent the underlying cluster structure than others. Additionally, in the process of searching for ``potential outliers'', one has to define the number of outliers n a dataset has, which may be impractical and is often unknown. In this paper, we propose an unsupervised strategy to find the ``best'' minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset. Moreover, we propose an unsupervised strategy to estimate a threshold for classifying points into inliers and (potential) outliers without the need to pre-define any value. Our experiments show that our strategies can automatically find the minpts value and threshold that yield the best or near best outlier detection results using GLOSH.


Tackling Polysemanticity with Neuron Embeddings

arXiv.org Artificial Intelligence

We present neuron embeddings, a representation that can be used to tackle polysemanticity by One common method for interpreting the behaviour of a neuron identifying the distinct semantic behaviours in a in a language model is to collect and study the dataset examples neuron's characteristic dataset examples, making which cause the highest neuron activation. Patterns downstream manual or automatic interpretation in a neuron's dataset examples provide an indication of what much easier. We apply our method to GPT2-small, the neuron responds to. However, polysemanticity makes and provide a UI for exploring the results. Neuron these dataset examples much harder to interpret, as there embeddings are computed using a model's internal are often many separate behaviours to understand, some representations and weights, making them of which may be related and others entirely distinct. This domain and architecture agnostic and removing becomes increasingly challenging as you collect examples the risk of introducing external structure which further down the activation spectrum, which is important may not reflect a model's actual computation. We for gaining a complete understanding of a neuron, but often describe how neuron embeddings can be used to reveals a wider range of behaviours (Bolukbasi et al., 2021).


Fast Disentangled Slim Tensor Learning for Multi-view Clustering

arXiv.org Artificial Intelligence

Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering . Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches. The code of DSTL is available at https://github.com/dengxu-nju/DSTL.


Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

arXiv.org Artificial Intelligence

Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.


A multi-dimensional unsupervised machine learning framework for clustering residential heat load profiles

arXiv.org Artificial Intelligence

Central to achieving the energy transition, heating systems provide essential space heating and hot water in residential and industrial environments. A major challenge lies in effectively profiling large clusters of buildings to improve demand estimation and enable efficient Demand Response (DR) schemes. This paper addresses this challenge by introducing an unsupervised machine learning framework for clustering residential heating load profiles, focusing on natural gas space heating and hot water preparation boilers. The profiles are analyzed across five dimensions: boiler usage, heating demand, weather conditions, building characteristics, and user behavior. We apply three distance metrics: Euclidean Distance (ED), Dynamic Time Warping (DTW), and Derivative Dynamic Time Warping (DDTW), and evaluate their performance using established clustering indices. The proposed method is assessed considering 29 residential buildings in Greece equipped with smart meters throughout a calendar heating season (i.e., 210 days). Results indicate that DTW is the most suitable metric, uncovering strong correlations between boiler usage, heat demand, and temperature, while ED highlights broader interrelations across dimensions and DDTW proves less effective, resulting in weaker clusters. These findings offer key insights into heating load behavior, establishing a solid foundation for developing more targeted and effective DR programs.


'Explaining RL Decisions with Trajectories': A Reproducibility Study

arXiv.org Artificial Intelligence

This work investigates the reproducibility of the paper "Explaining RL decisions with trajectories" by Deshmukh et al. (2023). The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest and HalfCheetah, Breakout, Q*Bert). While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. We conclude that, while some of the claims can be supported, further investigations and experiments could be of interest. We recognize the novelty of the work from the authors and hope that our work paves the way for clearer and more transparent approaches.