Clustering
ECORS: An Ensembled Clustering Approach to Eradicate The Local And Global Outlier In Collaborative Filtering Recommender System
Recommender systems are designed to suggest items based on user preferences, helping users navigate the vast amount of information available on the internet. Given the overwhelming content, outlier detection has emerged as a key research area in recommender systems. It involves identifying unusual or suspicious patterns in user behavior. However, existing studies in this field face several challenges, including the limited universality of algorithms, difficulties in selecting users, and a lack of optimization. In this paper, we propose an approach that addresses these challenges by employing various clustering algorithms. Specifically, we utilize a user-user matrix-based clustering technique to detect outliers. By constructing a user-user matrix, we can identify suspicious users in the system. Both local and global outliers are detected to ensure comprehensive analysis. Our experimental results demonstrate that this approach significantly improves the accuracy of outlier detection in recommender systems.
Linear Projections of Teacher Embeddings for Few-Class Distillation
Loo, Noel, Iliopoulos, Fotis, Hu, Wei, Vee, Erik
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher's output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher's internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model's generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher's embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.
Discriminative community detection for multiplex networks
Ortiz-Bouza, Meiby, Aviyente, Selin
Multiplex networks have emerged as a promising approach for modeling complex systems, where each layer represents a different mode of interaction among entities of the same type. A core task in analyzing these networks is to identify the community structure for a better understanding of the overall functioning of the network. While different methods have been proposed to detect the community structure of multiplex networks, the majority deal with extracting the consensus community structure across layers. In this paper, we address the community detection problem across two closely related multiplex networks. For example in neuroimaging studies, it is common to have multiple multiplex brain networks where each layer corresponds to an individual and each group to different experimental conditions. In this setting, one may be interested in both learning the community structure representing each experimental condition and the discriminative community structure between two groups. In this paper, we introduce two discriminative community detection algorithms based on spectral clustering. The first approach aims to identify the discriminative subgraph structure between the groups, while the second one learns the discriminative and the consensus community structures, simultaneously. The proposed approaches are evaluated on both simulated and real world multiplex networks.
ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation
Silva, Fillipe dos Santos, Kakimoto, Gabriel Kenzo, Reis, Julio Cesar dos, Reis, Marcelo S.
Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. These contexts often involve multimodal data, including both tabular and textual datasets, making it challenging to represent hidden patterns for obtaining meaningful clusters. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded tabular data and generate embeddings from the fine-tuned model. ERASMO employs a textual converter to transform tabular data into a textual format, enabling the language model to process and understand the data more effectively. Additionally, ERASMO produces contextually rich and structurally representative embeddings through techniques such as random feature sequence shuffling and number verbalization. Extensive experimental evaluations were conducted using multiple datasets and baseline approaches. Our results demonstrate that ERASMO fully leverages the specific context of each tabular dataset, leading to more precise and nuanced embeddings for accurate clustering. This approach enhances clustering performance by capturing complex relationship patterns within diverse tabular data.
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Grangier, David, Fan, Simin, Seto, Skyler, Ablin, Pierre
Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. We explore several approaches, with clustered importance sampling standing out. This method clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes. Generalist language models (LMs) can address a wide variety of tasks, but this generality comes at a cost (Brown et al., 2020). It necessitates a large training set representative of all prospective tasks, as well as a large model to fit such a comprehensive dataset.
ACEV: Unsupervised Intersecting Manifold Segmentation using Adaptation to Angular Change of Eigenvectors in Intrinsic Dimension
Boral, Subhadip, Pal, Rikathi, Ghosh, Ashish
Intersecting manifold segmentation has been a focus of research, where individual manifolds, that intersect with other manifolds, are separated to discover their distinct properties. The proposed method is based on the intuition that when a manifold in $D$ dimensional space with an intrinsic dimension of $d$ intersects with another manifold, the data variance grows in more than $d$ directions. The proposed method measures local data variances and determines their vector directions. It counts the number of vectors with non-zero variance, which determines the manifold's intrinsic dimension. For detection of the intersection region, the method adapts to the changes in the angular gaps between the corresponding direction vectors of the child and parent using exponential moving averages using a tree structure construction. Accordingly, it includes those data points in the same manifold whose neighborhood is within the adaptive angular difference and eventually identifies the data points in the intersection area of manifolds. Data points whose inclusion in the neighborhood-identified data points increases their intrinsic dimensionality are removed based on data variance and distance. The proposed method performs better than 18 SOTA manifold segmentation methods in ARI and NMI scores over 14 real-world datasets with lesser time complexity and better stability.
StreamEnsemble: Predictive Queries over Spatiotemporal Streaming Data
Chaves, Anderson, Ogasawara, Eduardo, Valduriez, Patrick, Porto, Fabio
Predictive queries over spatiotemporal (ST) stream data pose significant data processing and analysis challenges. ST data streams involve a set of time series whose data distributions may vary in space and time, exhibiting multiple distinct patterns. In this context, assuming a single machine learning model would adequately handle such variations is likely to lead to failure. To address this challenge, we propose StreamEnsemble, a novel approach to predictive queries over ST data that dynamically selects and allocates Machine Learning models according to the underlying time series distributions and model characteristics. Our experimental evaluation reveals that this method markedly outperforms traditional ensemble methods and single model approaches in terms of accuracy and time, demonstrating a significant reduction in prediction error of more than 10 times compared to traditional approaches.
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models
Chen, Shuhao, Jiang, Weisen, Lin, Baijiong, Kwok, James T., Zhang, Yu
Recent works show that assembling multiple off-the-shelf large language models (LLMs) can harness their complementary abilities. To achieve this, routing is a promising method, which learns a router to select the most suitable LLM for each query. However, existing routing models are ineffective when multiple LLMs perform well for a query. To address this problem, in this paper, we propose a method called query-based Router by Dual Contrastive learning (RouterDC). The RouterDC model consists of an encoder and LLM embeddings, and we propose two contrastive learning losses to train the RouterDC model. Experimental results show that RouterDC is effective in assembling LLMs and largely outperforms individual top-performing LLMs as well as existing routing methods on both in-distribution (+2.76\%) and out-of-distribution (+1.90\%) tasks. Source code is available at https://github.com/shuhao02/RouterDC.
Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods
Abdullah, Abdulhady Abas, Ahmed, Aram Mahmood, Rashid, Tarik, Veisi, Hadi, Rassul, Yassin Hussein, Hassan, Bryar, Fattah, Polla, Ali, Sabat Abdulhameed, Shamsaldin, Ahmed S.
Speech signal processing is a cornerstone of modern communication technologies, tasked with improving the clarity and comprehensibility of audio data in noisy environments. The primary challenge in this field is the effective separation and recognition of speech from background noise, crucial for applications ranging from voice-activated assistants to automated transcription services. The quality of speech recognition directly impacts user experience and accessibility in technology-driven communication. This review paper explores advanced clustering techniques, particularly focusing on the Kernel Fuzzy C-Means (KFCM) method, to address these challenges. Our findings indicate that KFCM, compared to traditional methods like K-Means (KM) and Fuzzy C-Means (FCM), provides superior performance in handling non-linear and non-stationary noise conditions in speech signals. The most notable outcome of this review is the adaptability of KFCM to various noisy environments, making it a robust choice for speech enhancement applications. Additionally, the paper identifies gaps in current methodologies, such as the need for more dynamic clustering algorithms that can adapt in real time to changing noise conditions without compromising speech recognition quality. Key contributions include a detailed comparative analysis of current clustering algorithms and suggestions for further integrating hybrid models that combine KFCM with neural networks to enhance speech recognition accuracy. Through this review, we advocate for a shift towards more sophisticated, adaptive clustering techniques that can significantly improve speech enhancement and pave the way for more resilient speech processing systems.
Continual learning with task specialist
Solomon, Indu, Aung, Aye Phyu Phyu, Kumar, Uttam, Jayavelu, Senthilnath
Continual learning (CL) adapt the deep learning scenarios with timely updated datasets. However, existing CL models suffer from the catastrophic forgetting issue, where new knowledge replaces past learning. In this paper, we propose Continual Learning with Task Specialists (CLTS) to address the issues of catastrophic forgetting and limited labelled data in real-world datasets by performing class incremental learning of the incoming stream of data. The model consists of Task Specialists (T S) and Task Predictor (T P ) with pre-trained Stable Diffusion (SD) module. Here, we introduce a new specialist to handle a new task sequence and each T S has three blocks; i) a variational autoencoder (V AE) to learn the task distribution in a low dimensional latent space, ii) a K-Means block to perform data clustering and iii) Bootstrapping Language-Image Pre-training (BLIP ) model to generate a small batch of captions from the input data. These captions are fed as input to the pre-trained stable diffusion model (SD) for the generation of task samples. The proposed model does not store any task samples for replay, instead uses generated samples from SD to train the T P module. A comparison study with four SOTA models conducted on three real-world datasets shows that the proposed model outperforms all the selected baselines