Clustering
Rethinking Clustering-Based Pseudo-Labeling for Unsupervised Meta-Learning
Dong, Xingping, Shen, Jianbing, Shao, Ling
The pioneering method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling. This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data. However, it often suffers from label inconsistency or limited diversity, which leads to poor performance. In this work, we prove that the core reason for this is lack of a clustering-friendly property in the embedding space. We address this by minimizing the inter- to intra-class similarity ratio to provide clustering-friendly embedding features, and validate our approach through comprehensive experiments. Note that, despite only utilizing a simple clustering algorithm (k-means) in our embedding space to obtain the pseudo-labels, we achieve significant improvement. Moreover, we adopt a progressive evaluation mechanism to obtain more diverse samples in order to further alleviate the limited diversity problem. Finally, our approach is also model-agnostic and can easily be integrated into existing supervised methods. To demonstrate its generalization ability, we integrate it into two representative algorithms: MAML and EP. The results on three main few-shot benchmarks clearly show that the proposed method achieves significant improvement compared to state-of-the-art models. Notably, our approach also outperforms the corresponding supervised method in two tasks.
Regime-based Implied Stochastic Volatility Model for Crypto Option Pricing
Saef, Danial, Wang, Yuanrong, Aste, Tomaso
The increasing adoption of Digital Assets (DAs), such as Bitcoin (BTC), rises the need for accurate option pricing models. Yet, existing methodologies fail to cope with the volatile nature of the emerging DAs. Many models have been proposed to address the unorthodox market dynamics and frequent disruptions in the microstructure caused by the non-stationarity, and peculiar statistics, in DA markets. However, they are either prone to the curse of dimensionality, as additional complexity is required to employ traditional theories, or they overfit historical patterns that may never repeat. Instead, we leverage recent advances in market regime (MR) clustering with the Implied Stochastic Volatility Model (ISVM). Time-regime clustering is a temporal clustering method, that clusters the historic evolution of a market into different volatility periods accounting for non-stationarity. ISVM can incorporate investor expectations in each of the sentiment-driven periods by using implied volatility (IV) data. In this paper, we applied this integrated time-regime clustering and ISVM method (termed MR-ISVM) to high-frequency data on BTC options at the popular trading platform Deribit. We demonstrate that MR-ISVM contributes to overcome the burden of complex adaption to jumps in higher order characteristics of option pricing models. This allows us to price the market based on the expectations of its participants in an adaptive fashion.
Self-supervised similarity models based on well-logging data
Egorov, Sergey, Gevorgyan, Narek, Zaytsev, Alexey
Adopting data-based approaches leads to model improvement in numerous Oil&Gas logging data processing problems. These improvements become even more sound due to new capabilities provided by deep learning. However, usage of deep learning is limited to areas where researchers possess large amounts of high-quality data. We present an approach that provides universal data representations suitable for solutions to different problems for different oil fields with little additional data. Our approach relies on the self-supervised methodology for sequential logging data for intervals from well, so it also doesn't require labelled data from the start. For validation purposes of the received representations, we consider classification and clusterization problems. We as well consider the transfer learning scenario. We found out that using the variational autoencoder leads to the most reliable and accurate models. approach We also found that a researcher only needs a tiny separate data set for the target oil field to solve a specific problem on top of universal representations.
Graph clustering with Boltzmann machines
Miasnikof, Pierre, Bagherbeik, Mohammad, Sheikholeslami, Ali
Graph clustering is the process of grouping vertices into densely connected sets called clusters. We tailor two mathematical programming formulations from the literature, to this problem. In doing so, we obtain a heuristic approximation to the intra-cluster density maximization problem. We use two variations of a Boltzmann machine heuristic to obtain numerical solutions. For benchmarking purposes, we compare solution quality and computational performances to those obtained using a commercial solver, Gurobi. We also compare clustering quality to the clusters obtained using the popular Louvain modularity maximization method. Our initial results clearly demonstrate the superiority of our problem formulations. They also establish the superiority of the Boltzmann machine over the traditional exact solver. In the case of smaller less complex graphs, Boltzmann machines provide the same solutions as Gurobi, but with solution times that are orders of magnitude lower. In the case of larger and more complex graphs, Gurobi fails to return meaningful results within a reasonable time frame. Finally, we also note that both our clustering formulations, the distance minimization and $K$-medoids, yield clusters of superior quality to those obtained with the Louvain algorithm.
Improving Image Clustering through Sample Ranking and Its Application to remote--sensing images
Image clustering is a very useful technique that is widely applied to various areas, including remote sensing. Recently, visual representations by self-supervised learning have greatly improved the performance of image clustering. To further improve the well-trained clustering models, this paper proposes a novel method by first ranking samples within each cluster based on the confidence in their belonging to the current cluster and then using the ranking to formulate a weighted cross-entropy loss to train the model. For ranking the samples, we developed a method for computing the likelihood of samples belonging to the current clusters based on whether they are situated in densely populated neighborhoods, while for training the model, we give a strategy for weighting the ranked samples. We present extensive experimental results that demonstrate that the new technique can be used to improve the State-of-the-Art image clustering models, achieving accuracy performance gains ranging from $2.1\%$ to $15.9\%$. Performing our method on a variety of datasets from remote sensing, we show that our method can be effectively applied to remote--sensing images.
Clustering by Direct Optimization of the Medoid Silhouette
Lenssen, Lars, Schubert, Erich
The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, and provide two fast versions for the direct optimization. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm.
5-Star Hotel Customer Satisfaction Analysis Using Hybrid Methodology
Yoo, Yongmin, Park, Yeongjoon, Lim, Dongjin, Seo, Deaho
Due to the rapid development of non-face-to-face services due to the corona virus, commerce through the Internet, such as sales and reservations, is increasing very rapidly. Consumers also post reviews, suggestions, or judgments about goods or services on the website. The review data directly used by consumers provides positive feedback and nice impact to consumers, such as creating business value. Therefore, analysing review data is very important from a marketing point of view. Our research suggests a new way to find factors for customer satisfaction through review data. We applied a method to find factors for customer satisfaction by mixing and using the data mining technique, which is a big data analysis method, and the natural language processing technique, which is a language processing method, in our research. Unlike many studies on customer satisfaction that have been conducted in the past, our research has a novelty of the thesis by using various techniques. And as a result of the analysis, the results of our experiments were very accurate.
VAESim: A probabilistic approach for self-supervised prototype discovery
Ferrante, Matteo, Boccato, Tommaso, Spasov, Simeon, Duggento, Andrea, Toschi, Nicola
In medicine, curated image datasets often employ discrete labels to describe what is known to be a continuous spectrum of healthy to pathological conditions, such as e.g. the Alzheimer's Disease Continuum or other areas where the image plays a pivotal point in diagnosis. We propose an architecture for image stratification based on a conditional variational autoencoder. Our framework, VAESim, leverages a continuous latent space to represent the continuum of disorders and finds clusters during training, which can then be used for image/patient stratification. The core of the method learns a set of prototypical vectors, each associated with a cluster. First, we perform a soft assignment of each data sample to the clusters. Then, we reconstruct the sample based on a similarity measure between the sample embedding and the prototypical vectors of the clusters. To update the prototypical embeddings, we use an exponential moving average of the most similar representations between actual prototypes and samples in the batch size. We test our approach on the MNIST-handwritten digit dataset and on a medical benchmark dataset called PneumoniaMNIST. We demonstrate that our method outperforms baselines in terms of kNN accuracy measured on a classification task against a standard VAE (up to 15% improvement in performance) in both datasets, and also performs at par with classification models trained in a fully supervised way. We also demonstrate how our model outperforms current, end-to-end models for unsupervised stratification.
Graph Representation Learning for Energy Demand Data: Application to Joint Energy System Planning under Emissions Constraints
Brenner, Aron, Khorramfar, Rahman, Mallapragada, Dharik, Amin, Saurabh
A rapid transformation of current electric power and natural gas (NG) infrastructure is imperative to meet the mid-century goal of CO2 emissions reduction requires. This necessitates a long-term planning of the joint power-NG system under representative demand and supply patterns, operational constraints, and policy considerations. Our work is motivated by the computational and practical challenges associated with solving the generation and transmission expansion problem (GTEP) for joint planning of power-NG systems. Specifically, we focus on efficiently extracting a set of representative days from power and NG data in respective networks and using this set to reduce the computational burden required to solve the GTEP. We propose a Graph Autoencoder for Multiple time resolution Energy Systems (GAMES) to capture the spatio-temporal demand patterns in interdependent networks and account for differences in the temporal resolution of available data. The resulting embeddings are used in a clustering algorithm to select representative days. We evaluate the effectiveness of our approach in solving a GTEP formulation calibrated for the joint power-NG system in New England. This formulation accounts for the physical interdependencies between power and NG systems, including the joint emissions constraint. Our results show that the set of representative days obtained from GAMES not only allows us to tractably solve the GTEP formulation, but also achieves a lower cost of implementing the joint planning decisions.
Codeless Machine Learning for Auditors
In the new era of digitalisation, there are emerging challenges relating to the new technology risks while fraud risk increases, adopting new and innovative methods in line with technological progress. In this environment, audit should develop data analytics skills that are beyond traditional risk monitoring and fraud detection tools to meet stakeholders' evolving expectations. Clustering is an unsupervised Machine Learning (ML) algorithm (i.e. an algorithm that learns and improves from experience, without input from users) that looks for patterns in data by dividing it into clusters. These clusters are created such that the points are homogenous within the cluster and heterogenous across clusters. Clustering is commonly used in market segmentation and several areas of marketing analytics as well as in fraud detection.