Clustering
Dirichlet process mixture models for non-stationary data streams
In recent years, we have seen a handful of work on inference algorithms over non-stationary data streams. Given their flexibility, Bayesian non-parametric models are a good candidate for these scenarios. However, reliable streaming inference under the concept drift phenomenon is still an open problem for these models. In this work, we propose a variational inference algorithm for Dirichlet process mixture models. Our proposal deals with the concept drift by including an exponential forgetting over the prior global parameters. Our algorithm allows to adapt the learned model to the concept drifts automatically. We perform experiments in both synthetic and real data, showing that the proposed model is competitive with the state-of-the-art algorithms in the density estimation problem, and it outperforms them in the clustering problem.
Towards Mining Creative Thinking Patterns from Educational Data
Creativity, i.e., the process of generating and developing fresh and original ideas or products that are useful or effective, is a valuable skill in a variety of domains. Creativity is called an essential 21st-century skill that should be taught in schools. The use of educational technology to promote creativity is an active study field, as evidenced by several studies linking creativity in the classroom to beneficial learning outcomes. Despite the burgeoning body of research on adaptive technology for education, mining creative thinking patterns from educational data remains a challenging task. In this paper, to address this challenge, we put the first step towards formalizing educational knowledge by constructing a domain-specific Knowledge Base to identify essential concepts, facts, and assumptions in identifying creative patterns. We then introduce a pipeline to contextualize the raw educational data, such as assessments and class activities. Finally, we present a rule-based approach to learning from the Knowledge Base, and facilitate mining creative thinking patterns from contextualized data and knowledge. We evaluate our approach with real-world datasets and highlight how the proposed pipeline can help instructors understand creative thinking patterns from students' activities and assessment tasks.
Wasserstein $K$-means for clustering probability distributions
Zhuang, Yubo, Chen, Xiaohui, Yang, Yun
Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based $K$-means can achieve better classification performance over the standard centroid-based $K$-means for clustering probability distributions and images.
Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment
González-Gutiérrez, César, Primadhanty, Audi, Cazzaro, Francesco, Quattoni, Ariadna
Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation budgets the choice of data representation is key. The goal of this paper is to better understand why this is so. With this goal in mind, we propose a metric that measures the extent to which a given representation is structurally aligned with a task. We conduct experiments on several text classification datasets testing a variety of models and representations. Using our proposed metric we show that an efficient representation for a task (i.e. one that enables learning from few samples) is a representation that induces a good alignment between latent input structure and class structure.
Time-aware topic identification in social media with pre-trained language models: A case study of electric vehicles
Jeong, Byeongki, Yoon, Janghyeok, Choi, Jaewoong
Recent extensively competitive business environment makes companies to keep their eyes on social media, as there is a growing recognition over customer languages (e.g., needs, interests, and complaints) as source of future opportunities. This research avenue analysing social media data has received much attention in academia, but their utilities are limited as most of methods provide retrospective results. Moreover, the increasing number of customer-generated contents and rapidly varying topics have made the necessity of time-aware topic evolution analyses. Recently, several researchers have showed the applicability of pre-trained semantic language models to social media as an input feature, but leaving limitations in understanding evolving topics. In this study, we propose a time-aware topic identification approach with pre-trained language models. The proposed approach consists of two stages: the dynamics-focused function for tracking time-varying topics with language models and the emergence-scoring function to examine future promising topics. Here we apply the proposed approach to reddit data on electric vehicles, and our findings highlight the feasibility of capturing emerging customer topics from voluminous social media in a time-aware manner.
Word Sense Induction with Hierarchical Clustering and Mutual Information Maximization
Abdine, Hadi, Eddine, Moussa Kamal, Vazirgiannis, Michalis, Buscaldi, Davide
Word sense induction (WSI) is a difficult problem in natural language processing that involves the unsupervised automatic detection of a word's senses (i.e. meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses, whereas others employ previously pre-trained language models in conjunction with additional strategies to induce senses. In this paper, we propose a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically demonstrate that, in certain cases, our approach outperforms prior WSI state-of-the-art methods, while in others, it achieves a competitive performance.
Client Error Clustering Approaches in Content Delivery Networks (CDN)
Birihanu, Ermiyas, Mahmud, Jiyan, Kiss, Péter, Kamuzora, Adolf, Skaf, Wadie, Horváth, Tomáš, Jursonovics, Tamás, Pogrzeba, Peter, Lendák, Imre
Content delivery networks (CDNs) are the backbone of the Internet and are key in delivering high quality video on demand (VoD), web content and file services to billions of users. CDNs usually consist of hierarchically organized content servers positioned as close to the customers as possible. CDN operators face a significant challenge when analyzing billions of web server and proxy logs generated by their systems. The main objective of this study was to analyze the applicability of various clustering methods in CDN error log analysis. We worked with real-life CDN proxy logs, identified key features included in the logs (e.g., content type, HTTP status code, time-of-day, host) and clustered the log lines corresponding to different host types offering live TV, video on demand, file caching and web content. Our experiments were run on a dataset consisting of proxy logs collected over a 7-day period from a single, physical CDN server running multiple types of services (VoD, live TV, file). The dataset consisted of 2.2 billion log lines. Our analysis showed that CDN error clustering is a viable approach towards identifying recurring errors and improving overall quality of service.
HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis
Author summary Modern experimental techniques such as mass cytometry (CyTOF) make it possible to quickly make high-dimensional measurements on upwards of tens of millions of cells with single-cell resolution. An important problem in biology is to use these measurements to group together similar cells to identify biologically meaningful cell types that can be used to study disease progression, drug responses, and other clinical outcomes. However, the size and complexity of experimental data sets makes this problem computationally and theoretically extremely difficult. Here, we present a new algorithm HAL-X that accurately and quickly identifies cell clusters from biological data. Importantly, our algorithm does not require large amounts of memory. This eliminates the need for specialized high-end computing resources, allowing biologists to quickly analyze their data using a standard laptop or desktop computer.
Self-Supervised Visual Representation Learning with Semantic Grouping
Wen, Xin, Zhao, Bingchen, Zheng, Anlin, Zhang, Xiangyu, Qi, Xiaojuan
In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability. Instead, we propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning. The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and conversely facilitates grouping semantically coherent pixels together. Compared with previous efforts, by simultaneously optimizing the two coupled objectives of semantic grouping and contrastive learning, our approach bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. Experiments show our approach effectively decomposes complex scenes into semantic groups for feature learning and significantly benefits downstream tasks, including object detection, instance segmentation, and semantic segmentation. Code is available at: https://github.com/CVMI-Lab/SlotCon.
A Survey of Methods for Automated Algorithm Configuration
Schede, Elias (a:1:{s:5:"en_US";s:20:"Bielefeld University";}) | Brandt, Jasmin (Department of Computer Science, Paderborn University) | Tornede, Alexander ( Department of Computer Science, Paderborn University,) | Wever, Marcel (Institute of Informatics, LMU Munich) | Bengs, Viktor (Institute of Informatics, LMU Munich) | Hüllermeier, Eyke (Institute of Informatics, LMU Munich) | Tierney, Kevin (Decision and Operation Technologies Group, Bielefeld University)
Algorithm configuration (AC) is concerned with the automated search of the most suitable parameter configuration of a parametrized algorithm. There is currently a wide variety of AC problem variants and methods proposed in the literature. Existing reviews do not take into account all derivatives of the AC problem, nor do they offer a complete classification scheme. To this end, we introduce taxonomies to describe the AC problem and features of configuration methods, respectively. We review existing AC literature within the lens of our taxonomies, outline relevant design choices of configuration approaches, contrast methods and problem variants against each other, and describe the state of AC in industry. Finally, our review provides researchers and practitioners with a look at future research directions in the field of AC.