Goto

Collaborating Authors

 Representation Of Examples


Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

arXiv.org Artificial Intelligence

The polysemantic nature of synthetic neurons in artificial intelligence language models is currently understood as the result of a necessary superposition of distributed features within the latent space. We propose an alternative approach, geometrically defining a neuron in layer n as a categorical vector space with a non-orthogonal basis, composed of categorical sub-dimensions extracted from preceding neurons in layer n-1. This categorical vector space is structured by the activation space of each neuron and enables, via an intra-neuronal attention process, the identification and utilization of a critical categorical zone for the efficiency of the language model - more homogeneous and located at the intersection of these different categorical sub-dimensions.


From Two Sample Testing to Singular Gaussian Discrimination

arXiv.org Machine Learning

We establish that testing for the equality of two probability measures on a general separable and compact metric space is equivalent to testing for the singularity between two corresponding Gaussian measures on a suitable Reproducing Kernel Hilbert Space. The corresponding Gaussians are defined via the notion of kernel mean and covariance embedding of a probability measure. Discerning two singular Gaussians is fundamentally simpler from an information-theoretic perspective than non-parametric two-sample testing, particularly in high-dimensional settings. Our proof leverages the Feldman-Hajek criterion for singularity/equivalence of Gaussians on Hilbert spaces, and shows that discrepancies between distributions are heavily magnified through their corresponding Gaussian embeddings: at a population level, distinct probability measures lead to essentially separated Gaussian embeddings. This appears to be a new instance of the blessing of dimensionality that can be harnessed for the design of efficient inference tools in great generality.


Self-Balancing, Memory Efficient, Dynamic Metric Space Data Maintenance, for Rapid Multi-Kernel Estimation

arXiv.org Artificial Intelligence

We present a dynamic self-balancing octree data structure that enables efficient neighborhood maintenance in evolving metric spaces, a key challenge in modern machine learning systems. Many learning and generative models operate as dynamical systems whose representations evolve during training, requiring fast, adaptive spatial organization. Our two-parameter octree supports logarithmic-time updates and queries, eliminating the need for costly full rebuilds as data distributions shift. We demonstrate its effectiveness in four areas: (1) accelerating Stein variational gradient descent by supporting more particles with lower overhead; (2) enabling real-time, incremental KNN classification with logarithmic complexity; (3) facilitating efficient, dynamic indexing and retrieval for retrieval-augmented generation; and (4) improving sample efficiency by jointly optimizing input and latent spaces. Across all applications, our approach yields exponential speedups while preserving accuracy, particularly in high-dimensional spaces where maintaining adaptive spatial structure is critical.


Robust Estimation in metric spaces: Achieving Exponential Concentration with a Fr\'echet Median

arXiv.org Machine Learning

There is growing interest in developing statistical estimators that achieve exponential concentration around a population target even when the data distribution has heavier than exponential tails. More recent activity has focused on extending such ideas beyond Euclidean spaces to Hilbert spaces and Riemannian manifolds. In this work, we show that such exponential concentration in presence of heavy tails can be achieved over a broader class of parameter spaces called CAT($\kappa$) spaces, a very general metric space equipped with the minimal essential geometric structure for our purpose, while being sufficiently broad to encompass most typical examples encountered in statistics and machine learning. The key technique is to develop and exploit a general concentration bound for the Fr\'echet median in CAT($\kappa$) spaces. We illustrate our theory through a number of examples, and provide empirical support through simulation studies.


Bregman-Hausdorff divergence: strengthening the connections between computational geometry and machine learning

arXiv.org Artificial Intelligence

The purpose of this paper is twofold. On a technical side, we propose an extension of the Hausdorff distance from metric spaces to spaces equipped with asymmetric distance measures. Specifically, we focus on the family of Bregman divergences, which includes the popular Kullback--Leibler divergence (also known as relative entropy). As a proof of concept, we use the resulting Bregman--Hausdorff divergence to compare two collections of probabilistic predictions produced by different machine learning models trained using the relative entropy loss. The algorithms we propose are surprisingly efficient even for large inputs with hundreds of dimensions. In addition to the introduction of this technical concept, we provide a survey. It outlines the basics of Bregman geometry, as well as computational geometry algorithms. We focus on algorithms that are compatible with this geometry and are relevant for machine learning.


Random Normed k-Means: A Paradigm-Shift in Clustering within Probabilistic Metric Spaces

arXiv.org Machine Learning

Existing approaches remain largely constrained by traditional distance metrics, limiting their effectiveness in handling random data. In this work, we introduce the first k-means variant in the literature that operates within a probabilistic metric space, replacing conventional distance measures with a well-defined distance distribution function. This pioneering approach enables more flexible and robust clustering in both deterministic and random datasets, establishing a new foundation for clustering in stochastic environments. By adopting a probabilistic perspective, our method not only introduces a fresh paradigm but also establishes a rigorous theoretical framework that is expected to serve as a key reference for future clustering research involving random data. Extensive experiments on diverse real and synthetic datasets assess our model's effectiveness using widely recognized evaluation metrics, including Silhouette, Davies-Bouldin, Calinski Harabasz, the adjusted Rand index, and distortion. Comparative analyses against established methods such as k-means++, fuzzy c-means, and kernel probabilistic k-means demonstrate the superior performance of our proposed random normed k-means (RNKM) algorithm. Notably, RNKM exhibits a remarkable ability to identify nonlinearly separable structures, making it highly effective in complex clustering scenarios. These findings position RNKM as a groundbreaking advancement in clustering research, offering a powerful alternative to traditional techniques while addressing a long-standing gap in the literature. By bridging probabilistic metrics with clustering, this study provides a foundational reference for future developments and opens new avenues for advanced data analysis in dynamic, data-driven applications.


New universal operator approximation theorem for encoder-decoder architectures (Preprint)

arXiv.org Artificial Intelligence

Motivated by the rapidly growing field of mathematics for operator approximation with neural networks, we present a novel universal operator approximation theorem for a broad class of encoder-decoder architectures. In this study, we focus on approximating continuous operators in $\mathcal{C}(\mathcal{X}, \mathcal{Y})$, where $\mathcal{X}$ and $\mathcal{Y}$ are infinite-dimensional normed or metric spaces, and we consider uniform convergence on compact subsets of $\mathcal{X}$. Unlike standard results in the operator learning literature, we investigate the case where the approximating operator sequence can be chosen independently of the compact sets. Taking a topological perspective, we analyze different types of operator approximation and show that compact-set-independent approximation is a strictly stronger property in most relevant operator learning frameworks. To establish our results, we introduce a new approximation property tailored to encoder-decoder architectures, which enables us to prove a universal operator approximation theorem ensuring uniform convergence on every compact subset. This result unifies and extends existing universal operator approximation theorems for various encoder-decoder architectures, including classical DeepONets, BasisONets, special cases of MIONets, architectures based on frames and other related approaches.


Node Embeddings via Neighbor Embeddings

arXiv.org Artificial Intelligence

Graph layouts and node embeddings are two distinct paradigms for non-parametric graph representation learning. In the former, nodes are embedded into 2D space for visualization purposes. In the latter, nodes are embedded into a high-dimensional vector space for downstream processing. State-of-the-art algorithms for these two paradigms, force-directed layouts and random-walk-based contrastive learning (such as DeepWalk and node2vec), have little in common. In this work, we show that both paradigms can be approached with a single coherent framework based on established neighbor embedding methods. Specifically, we introduce graph t-SNE, a neighbor embedding method for two-dimensional graph layouts, and graph CNE, a contrastive neighbor embedding method that produces high-dimensional node representations by optimizing the InfoNCE objective. We show that both graph t-SNE and graph CNE strongly outperform state-of-the-art algorithms in terms of local structure preservation, while being conceptually simpler.


Learning Library Cell Representations in Vector Space

arXiv.org Artificial Intelligence

--We propose Lib2V ec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings. Experimental results demonstrate that Lib2V ec effectively captures functional and electrical similarities. Moreover, linear algebraic operations on cell vectors reveal meaningful relationships, such as vector(BUF) - vector(INV) + vector(NAND) approximating the vector of AND, showcasing the framework's nuanced representation capabilities. Lib2V ec also enhances downstream circuit learning applications, especially when labeled data is scarce. Library cell representations are vital for effective machine learning (ML)-based circuit analysis and optimization, as library cells are the fundamental building blocks of circuit netlists. Traditional methods often rely on manually defined features [1]-[4], requiring extensive expertise and feature engineering. Alternatively, one-hot encoding [5] demands large amounts of domain-specific training data, which may not always be available.


Network Embedding Exploration Tool (NEExT)

arXiv.org Artificial Intelligence

Many real-world and artificial systems and processes can be represented as graphs. Some examples of such systems include social networks, financial transactions, supply chains, and molecular structures. In many of these cases, one needs to consider a collection of graphs, rather than a single network. This could be a collection of distinct but related graphs, such as different protein structures or graphs resulting from dynamic processes on the same network. Examples of the latter include the evolution of social networks, community-induced graphs, or ego-nets around various nodes. A significant challenge commonly encountered is the absence of ground-truth labels for graphs or nodes, necessitating the use of unsupervised techniques to analyze such systems. Moreover, even when ground-truth labels are available, many existing graph machine learning methods depend on complex deep learning models, complicating model explainability and interpretability. To address some of these challenges, we have introduced NEExT (Network Embedding Exploration Tool) for embedding collections of graphs via user-defined node features. The advantages of the framework are twofold: (i) the ability to easily define your own interpretable node-based features in view of the task at hand, and (ii) fast embedding of graphs provided by the Vectorizers library. In this paper, we demonstrate the usefulness of NEExT on collections of synthetic and real-world graphs. For supervised tasks, we demonstrate that performance in graph classification tasks could be achieved similarly to other state-of-the-art techniques while maintaining model interpretability. Furthermore, our framework can also be used to generate high-quality embeddings in an unsupervised way, where target variables are not available.