Goto

Collaborating Authors

 Clustering


Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

arXiv.org Artificial Intelligence

The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.


Likely, Light, and Accurate Context-Free Clusters-based Trajectory Prediction

arXiv.org Artificial Intelligence

Autonomous systems in the road transportation network require intelligent mechanisms that cope with uncertainty to foresee the future. In this paper, we propose a multi-stage probabilistic approach for trajectory forecasting: trajectory transformation to displacement space, clustering of displacement time series, trajectory proposals, and ranking proposals. We introduce a new deep feature clustering method, underlying self-conditioned GAN, which copes better with distribution shifts than traditional methods. Additionally, we propose novel distance-based ranking proposals to assign probabilities to the generated trajectories that are more efficient yet accurate than an auxiliary neural network. The overall system surpasses context-free deep generative models in human and road agents trajectory data while performing similarly to point estimators when comparing the most probable trajectory.


A Bi-level Nonlinear Eigenvector Algorithm for Wasserstein Discriminant Analysis

arXiv.org Artificial Intelligence

As widely used feature extraction approaches in machine learning, dimensionality reduction (DR) methods [53, 7, 20, 12] learn projections such that the projected lower dimensional subspaces maintain the coherent structure of datasets and reduce computational costs of classification or clustering. The linear projection obtained from linear DR methods takes the form of a matrix such that the embedding to the lower dimensional subspace only involves matrix multiplications. Due to such ease in interpretation and implementation, linear DR methods are often the favored choice among numerous DR methods. For example, principal component analysis (PCA) [24] seeks to find a linear projection that preserves the dataset's variation and is one of the most common and well-known DR methods. Other well-known DR methods include Fisher linear discriminant analysis (LDA) [24] to take into account the information of classes and compute a linear projection that best separates different classes, and Mahalanobis metric learning [35] to seek a distance metric that better models the relationship among dataset from a linear projection. Wasserstein discriminant analysis (WDA) [19] is a supervised linear DR that is based on the use of regularized Wasserstein distances [13] as a distance metric. Similar to Fisher linear discriminant analysis (LDA), WDA seeks a projection matrix to maximize the dispersion of projected points between different classes and minimize the dispersion of projected points within same classes.


Evaluating the Moral Beliefs Encoded in LLMs

arXiv.org Artificial Intelligence

This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models "choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.


DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm

arXiv.org Artificial Intelligence

With the development of Big data technology, data analysis has become increasingly important. Traditional clustering algorithms such as K-means are highly sensitive to the initial centroid selection and perform poorly on non-convex datasets. In this paper, we address these problems by proposing a data-driven Bregman divergence parameter optimization clustering algorithm (DBGSA), which combines the Universal Gravitational Algorithm to bring similar points closer in the dataset. We construct a gravitational coefficient equation with a special property that gradually reduces the influence factor as the iteration progresses. Furthermore, we introduce the Bregman divergence generalized power mean information loss minimization to identify cluster centers and build a hyperparameter identification optimization model, which effectively solves the problems of manual adjustment and uncertainty in the improved dataset. Extensive experiments are conducted on four simulated datasets and six real datasets. The results demonstrate that DBGSA significantly improves the accuracy of various clustering algorithms by an average of 63.8\% compared to other similar approaches like enhanced clustering algorithms and improved datasets. Additionally, a three-dimensional grid search was established to compare the effects of different parameter values within threshold conditions, and it was discovered the parameter set provided by our model is optimal. This finding provides strong evidence of the high accuracy and robustness of the algorithm.


Towards Resolving Word Ambiguity with Word Embeddings

arXiv.org Artificial Intelligence

Ambiguity is ubiquitous in natural language. Resolving ambiguous meanings is especially important in information retrieval tasks. While word embeddings carry semantic information, they fail to handle ambiguity well. Transformer models have been shown to handle word ambiguity for complex queries, but they cannot be used to identify ambiguous words, e.g. for a 1-word query. Furthermore, training these models is costly in terms of time, hardware resources, and training data, prohibiting their use in specialized environments with sensitive data. Word embeddings can be trained using moderate hardware resources. This paper shows that applying DBSCAN clustering to the latent space can identify ambiguous words and evaluate their level of ambiguity. An automatic DBSCAN parameter selection leads to high-quality clusters, which are semantically coherent and correspond well to the perceived meanings of a given word.


Federated K-Means Clustering via Dual Decomposition-based Distributed Optimization

arXiv.org Artificial Intelligence

The use of distributed optimization in machine learning can be motivated either by the resulting preservation of privacy or the increase in computational efficiency. On the one hand, training data might be stored across multiple devices. Training a global model within a network where each node only has access to its confidential data requires the use of distributed algorithms. Even if the data is not confidential, sharing it might be prohibitive due to bandwidth limitations. On the other hand, the ever-increasing amount of available data leads to large-scale machine learning problems. By splitting the training process across multiple nodes its efficiency can be significantly increased. This paper aims to demonstrate how dual decomposition can be applied for distributed training of $ K $-means clustering problems. After an overview of distributed and federated machine learning, the mixed-integer quadratically constrained programming-based formulation of the $ K $-means clustering training problem is presented. The training can be performed in a distributed manner by splitting the data across different nodes and linking these nodes through consensus constraints. Finally, the performance of the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm are evaluated on a set of benchmark problems. While the mixed-integer programming-based formulation of the clustering problems suffers from weak integer relaxations, the presented approach can potentially be used to enable an efficient solution in the future, both in a central and distributed setting.


SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural Topic Models on Social Media Opinion Mining

arXiv.org Artificial Intelligence

Recent research in opinion mining proposed word embedding-based topic modeling methods that provide superior coherence compared to traditional topic modeling. In this paper, we demonstrate how these methods can be used to display correlated topic models on social media texts using SocialVisTUM, our proposed interactive visualization toolkit. It displays a graph with topics as nodes and their correlations as edges. Further details are displayed interactively to support the exploration of large text collections, e.g., representative words and sentences of topics, topic and sentiment distributions, hierarchical topic clustering, and customizable, predefined topic labels. The toolkit optimizes automatically on custom data for optimal coherence. We show a working instance of the toolkit on data crawled from English social media discussions about organic food consumption. The visualization confirms findings of a qualitative consumer research study. SocialVisTUM and its training procedures are accessible online.


Clustering-based Criticality Analysis for Testing of Automated Driving Systems

arXiv.org Artificial Intelligence

With the implementation of the new EU regulation 2022/1426 regarding the type-approval of the automated driving system (ADS) of fully automated vehicles, scenario-based testing has gained significant importance in evaluating the performance and safety of advanced driver assistance systems and automated driving systems. However, the exploration and generation of concrete scenarios from a single logical scenario can often lead to a number of similar or redundant scenarios, which may not contribute to the testing goals. This paper focuses on the the goal to reduce the scenario set by clustering concrete scenarios from a single logical scenario. By employing clustering techniques, redundant and uninteresting scenarios can be identified and eliminated, resulting in a representative scenario set. This reduction allows for a more focused and efficient testing process, enabling the allocation of resources to the most relevant and critical scenarios. Furthermore, the identified clusters can provide valuable insights into the scenario space, revealing patterns and potential problems with the system's behavior.


Classification of Consumer Belief Statements From Social Media

arXiv.org Artificial Intelligence

Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.