Plotting

 Heiter, Edith


Large Language Models Reflect the Ideology of their Creators

arXiv.org Artificial Intelligence

Large language models (LLMs) are trained on vast amounts of data to generate natural language, enabling them to perform tasks like text summarization and question answering. These models have become popular in artificial intelligence (AI) assistants like ChatGPT and already play an influential role in how humans access information. However, the behavior of LLMs varies depending on their design, training, and use. In this paper, we uncover notable diversity in the ideological stance exhibited across different LLMs and languages in which they are accessed. We do this by prompting a diverse panel of popular LLMs to describe a large number of prominent and controversial personalities from recent world history, both in English and in Chinese. By identifying and analyzing moral assessments reflected in the generated descriptions, we find consistent normative differences between how the same LLM responds in Chinese compared to English. Similarly, we identify normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. Furthermore, popularly hypothesized disparities in political goals among Western models are reflected in significant normative differences related to inclusion, social inequality, and political scandals. Our results show that the ideological stance of an LLM often reflects the worldview of its creators. This raises important concerns around technological and regulatory efforts with the stated aim of making LLMs ideologically `unbiased', and it poses risks for political instrumentalization.


Pattern or Artifact? Interactively Exploring Embedding Quality with TRACE

arXiv.org Artificial Intelligence

This paper presents TRACE, a tool to analyze the quality of 2D embeddings generated through dimensionality reduction techniques. Dimensionality reduction methods often prioritize preserving either local neighborhoods or global distances, but insights from visual structures can be misleading if the objective has not been achieved uniformly. TRACE addresses this challenge by providing a scalable and extensible pipeline for computing both local and global quality measures. The interactive browser-based interface allows users to explore various embeddings while visually assessing the pointwise embedding quality. The interface also facilitates in-depth analysis by highlighting high-dimensional nearest neighbors for any group of points and displaying high-dimensional distances between points. TRACE enables analysts to make informed decisions regarding the most suitable dimensionality reduction method for their specific use case, by showing the degree and location where structure is preserved in the reduced space.


Topologically Regularized Data Embeddings

arXiv.org Artificial Intelligence

Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.


Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors

arXiv.org Artificial Intelligence

Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.


Factoring out prior knowledge from low-dimensional embeddings

arXiv.org Machine Learning

Embedding high dimensional data into low dimensional spaces, such as with tSNE [van der Maaten and Hinton, 2008] or UMAP [McInnes et al., 2018], allow us to visually inspect and discover meaningful structure from the data that would otherwise be difficult or impossible to see. These methods are as popular as they are useful, but, at the same time limited in that they are one-shot only: they embed the data as is, and that is that. If the resulting embedding reveals novel knowledge, all is well, but, what if the structure that dominates it is something we already know, something we are no longer interested in, or, if we want to discover whether the data has meaningful structure other than what the first result revealed? In word embeddings, for example, we may already know that certain words are synonyms, while in single cell sequencing we may want to discover structure other than known cell types, or factor out family relationships. The question at hand is therefore, how can we obtain low-dimensional embeddings that reveal structure beyond what we already know, i.e. how to factor out prior knowledge from low-dimensional embeddings? For conditional embeddings, research so far mostly focused on emphasizing rather than factoring out prior knowledge [Barshan et al., 2011, De Ridder et al., 2003, Hanhijärvi et al., 2009], with conditional tSNE as notable exception, which, however, can only factor out label information [Kang et al., 2019]. Here, we propose two techniques for factoring out a more general form of prior knowledge from low-dimensional embeddings of arbitrary data types. In particular, we consider background knowledge in the form of pairwise distances between samples. This formulation allows us to cover a plethora of practical instances including labels, clustering structure, family trees, user-defined distances, but also, and especially important for unstructured data, kernel matrices.