Representation Of Examples
Fast, smooth and adaptive regression in metric spaces
It was recently shown that certain nonparametric regressors can escape the curse of dimensionality in the sense that their convergence rates adapt to the intrinsic dimension of data (\cite{BL:65, SK:77}). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, operates on a general metric space, yields a smooth function, and evaluates in time O(\log n) . We derive a tight convergence rate of the form n {-2/(2 d)} where d is the Assouad dimension of the input space.
On dimensionality of feature vectors in MPNNs
Bravo, César, Kozachinskiy, Alexander, Rojas, Cristóbal
We revisit the classical result of Morris et al.~(AAAI'19) that message-passing graphs neural networks (MPNNs) are equal in their distinguishing power to the Weisfeiler--Leman (WL) isomorphism test. Morris et al.~show their simulation result with ReLU activation function and $O(n)$-dimensional feature vectors, where $n$ is the number of nodes of the graph. Recently, by introducing randomness into the architecture, Aamand et al.~(NeurIPS'22) were able to improve this bound to $O(\log n)$-dimensional feature vectors, although at the expense of guaranteeing perfect simulation only with high probability. In all these constructions, to guarantee equivalence to the WL test, the dimension of feature vectors in the MPNN has to increase with the size of the graphs. However, architectures used in practice have feature vectors of constant dimension. Thus, there is a gap between the guarantees provided by these results and the actual characteristics of architectures used in practice. In this paper we close this gap by showing that, for \emph{any} non-polynomial analytic (like the sigmoid) activation function, to guarantee that MPNNs are equivalent to the WL test, feature vectors of dimension $d=1$ is all we need, independently of the size of the graphs. Our main technical insight is that for simulating multi-sets in the WL-test, it is enough to use linear independence of feature vectors over rationals instead of reals. Countability of the set of rationals together with nice properties of analytic functions allow us to carry out the simulation invariant over the iterations of the WL test without increasing the dimension of the feature vectors.
Understanding Neural Network Systems for Image Analysis using Vector Spaces and Inverse Maps
Pattichis, Rebecca, Pattichis, Marios S.
There is strong interest in developing mathematical methods that can be used to understand complex neural networks used in image analysis. In this paper, we introduce techniques from Linear Algebra to model neural network layers as maps between signal spaces. First, we demonstrate how signal spaces can be used to visualize weight spaces and convolutional layer kernels. We also demonstrate how residual vector spaces can be used to further visualize information lost at each layer. Second, we introduce the concept of invertible networks and an algorithm for computing input images that yield specific outputs. We demonstrate our approach on two invertible networks and ResNet18.
Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space
Recently, optimization on the Riemannian manifold has provided new insights to the optimization community. In this regard, the manifold taken as the probability measure metric space equipped with the second-order Wasserstein distance is of particular interest, since optimization on it can be linked to practical sampling processes. In general, the oracle (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the continuous optimization methods in the Wasserstein space by extending the gradient flow into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. The two flows on Euclidean space are standard stochastic optimization methods, while their Riemannian counterparts are not explored yet. By leveraging the structures in Wasserstein space, we construct a stochastic differential equation (SDE) to approximate the discrete dynamics of desired stochastic methods in the corresponded random vector space. Then, the flows of probability measures are naturally obtained by applying Fokker-Planck equation to such SDE. Furthermore, the convergence rates of the proposed Riemannian stochastic flows are proven, and they match the results in Euclidean space.
Proportional Representation in Metric Spaces and Low-Distortion Committee Selection
Kalayci, Yusuf Hakan, Kempe, David, Kher, Vikram
We introduce a novel definition for a small set R of k points being "representative" of a larger set in a metric space. Given a set V (e.g., documents or voters) to represent, and a set C of possible representatives, our criterion requires that for any subset S comprising a theta fraction of V, the average distance of S to their best theta*k points in R should not be more than a factor gamma compared to their average distance to the best theta*k points among all of C. This definition is a strengthening of proportional fairness and core fairness, but - different from those notions - requires that large cohesive clusters be represented proportionally to their size. Since there are instances for which - unless gamma is polynomially large - no solutions exist, we study this notion in a resource augmentation framework, implicitly stating the constraints for a set R of size k as though its size were only k/alpha, for alpha > 1. Furthermore, motivated by the application to elections, we mostly focus on the "ordinal" model, where the algorithm does not learn the actual distances; instead, it learns only for each point v in V and each candidate pairs c, c' which of c, c' is closer to v. Our main result is that the Expanding Approvals Rule (EAR) of Aziz and Lee is (alpha, gamma) representative with gamma <= 1 + 6.71 * (alpha)/(alpha-1). Our results lead to three notable byproducts. First, we show that the EAR achieves constant proportional fairness in the ordinal model, giving the first positive result on metric proportional fairness with ordinal information. Second, we show that for the core fairness objective, the EAR achieves the same asymptotic tradeoff between resource augmentation and approximation as the recent results of Li et al., which used full knowledge of the metric. Finally, our results imply a very simple single-winner voting rule with metric distortion at most 44.
Feature Network Methods in Machine Learning and Applications
A machine learning (ML) feature network is a graph that connects ML features in learning tasks based on their similarity. This network representation allows us to view feature vectors as functions on the network. By leveraging function operations from Fourier analysis and from functional analysis, one can easily generate new and novel features, making use of the graph structure imposed on the feature vectors. Such network structures have previously been studied implicitly in image processing and computational biology. We thus describe feature networks as graph structures imposed on feature vectors, and provide applications in machine learning. One application involves graph-based generalizations of convolutional neural networks, involving structured deep learning with hierarchical representations of features that have varying depth or complexity. This extends also to learning algorithms that are able to generate useful new multilevel features. Additionally, we discuss the use of feature networks to engineer new features, which can enhance the expressiveness of the model. We give a specific example of a deep tree-structured feature network, where hierarchical connections are formed through feature clustering and feed-forward learning. This results in low learning complexity and computational efficiency. Unlike "standard" neural features which are limited to modulated (thresholded) linear combinations of adjacent ones, feature networks offer more general feedforward dependencies among features. For example, radial basis functions or graph structure-based dependencies between features can be utilized.
Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. II
Kumari, Sushma, Pestov, Vladimir G.
We continue to investigate the $k$ nearest neighbour learning rule in separable metric spaces. Thanks to the results of C\'erou and Guyader (2006) and Preiss (1983), this rule is known to be universally consistent in every metric space $X$ that is sigma-finite dimensional in the sense of Nagata. Here we show that the rule is strongly universally consistent in such spaces in the absence of ties. Under the tie-breaking strategy applied by Devroye, Gy\"{o}rfi, Krzy\.{z}ak, and Lugosi (1994) in the Euclidean setting, we manage to show the strong universal consistency in non-Archimedian metric spaces (that is, those of Nagata dimension zero). Combining the theorem of C\'erou and Guyader with results of Assouad and Quentin de Gromard (2006), one deduces that the $k$-NN rule is universally consistent in metric spaces having finite dimension in the sense of de Groot. In particular, the $k$-NN rule is universally consistent in the Heisenberg group which is not sigma-finite dimensional in the sense of Nagata as follows from an example independently constructed by Kor\'anyi and Reimann (1995) and Sawyer and Wheeden (1992).
Spectral Persistent Homology: Persistence Signals
Van Huffel, Michael Etienne, Palo, Matteo
In this paper, we present a novel family of descriptors for persistence diagrams, reconceptualizing them as signals in $\mathbb R^2_+$. This marks a significant advancement in Topological Data Analysis. Our methodology transforms persistence diagrams into a finite-dimensional vector space through functionals of the discrete measures induced by these diagrams. While our focus is primarily on frequency-based transformations, we do not restrict our approach exclusively to this types of techniques. We term this family of transformations as $Persistence$ $Signals$ and prove stability for some members of this family against the 1-$Kantorovitch$-$Rubinstein$ metric, ensuring its responsiveness to subtle data variations. Extensive comparative analysis reveals that our descriptor performs competitively with the current state-of-art from the topological data analysis literature, and often surpasses, the existing methods. This research not only introduces a groundbreaking perspective for data scientists but also establishes a foundation for future innovations in applying persistence diagrams in data analysis and machine learning.
Observable Propagation: A Data-Efficient Approach to Uncover Feature Vectors in Transformers
A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: ObsProp), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of observables, linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we provide theoretical motivation for why LayerNorm nonlinearities do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the coupling coefficient which estimates the degree to which one feature's output correlates with another's. We use ObsProp to perform extensive qualitative investigations into several tasks, including gendered occupational bias, political party prediction, and programming language detection. Our results suggest that ObsProp surpasses traditional approaches for finding feature vectors in the low-data regime, and that ObsProp can be used to better understand the mechanisms responsible for bias in large language models. Code for experiments can be found at github.com/jacobdunefsky/ObservablePropagation.
RDF-star2Vec: RDF-star Graph Embeddings for Data Mining
Egami, Shusaku, Ugai, Takanori, Oota, Masateru, Matsushita, Kyoumoto, Kawamura, Takahiro, Kozaki, Kouji, Fukuda, Ken
Knowledge Graphs (KGs) such as Resource Description Framework (RDF) data represent relationships between various entities through the structure of triples (