Representation Of Examples
A Vectorization Method Induced By Maximal Margin Classification For Persistent Diagrams
Wu, An, Pan, Yu, Zhou, Fuqi, Yan, Jinghui, Liu, Chuanlu
Persistent homology is an effective method for extracting topological information, represented as persistent diagrams, of spatial structure data. Hence it is well-suited for the study of protein structures. Attempts to incorporate Persistent homology in machine learning methods of protein function prediction have resulted in several techniques for vectorizing persistent diagrams. However, current vectorization methods are excessively artificial and cannot ensure the effective utilization of information or the rationality of the methods. To address this problem, we propose a more geometrical vectorization method of persistent diagrams based on maximal margin classification for Banach space, and additionaly propose a framework that utilizes topological data analysis to identify proteins with specific functions. We evaluated our vectorization method using a binary classification task on proteins and compared it with the statistical methods that exhibit the best performance among thirteen commonly used vectorization methods. The experimental results indicate that our approach surpasses the statistical methods in both robustness and precision.
Artificial neural networks on graded vector spaces
We develop new artificial neural network models for graded vector spaces, which are suitable when different features in the data have different significance (weights). This is the first time that such models are designed mathematically and they are expected to perform better than neural networks over usual vector spaces, which are the special case when the gradings are all 1s.
Relational Composition in Neural Networks: A Survey and Call to Action
Wattenberg, Martin, Viégas, Fernanda B.
Many neural nets appear to represent data as linear combinations of "feature vectors." Algorithms for discovering these vectors have seen impressive recent success. However, we argue that this success is incomplete without an understanding of relational composition: how (or whether) neural nets combine feature vectors to represent more complicated relationships. To facilitate research in this area, this paper offers a guided tour of various relational mechanisms that have been proposed, along with preliminary analysis of how such mechanisms might affect the search for interpretable features. We end with a series of promising areas for empirical research, which may help determine how neural networks represent structured data.
Metric Dimension and Resolvability of Jaccard Spaces
Lladser, Manuel E., Paradise, Alexander J.
A subset of points in a metric space is said to resolve it if each point in the space is uniquely characterized by its distance to each point in the subset. In particular, resolving sets can be used to represent points in abstract metric spaces as Euclidean vectors. Importantly, due to the triangle inequality, points close by in the space are represented as vectors with similar coordinates, which may find applications in classification problems of symbolic objects under suitably chosen metrics. In this manuscript, we address the resolvability of Jaccard spaces, i.e., metric spaces of the form $(2^X,\text{Jac})$, where $2^X$ is the power set of a finite set $X$, and $\text{Jac}$ is the Jaccard distance between subsets of $X$. Specifically, for different $a,b\in 2^X$, $\text{Jac}(a,b)=|a\Delta b|/|a\cup b|$, where $|\cdot|$ denotes size (i.e., cardinality) and $\Delta$ denotes the symmetric difference of sets. We combine probabilistic and linear algebra arguments to construct highly likely but nearly optimal (i.e., of minimal size) resolving sets of $(2^X,\text{Jac})$. In particular, we show that the metric dimension of $(2^X,\text{Jac})$, i.e., the minimum size of a resolving set of this space, is $\Theta(|X|/\ln|X|)$. In addition, we show that a much smaller subset of $2^X$ suffices to resolve, with high probability, all different pairs of subsets of $X$ of cardinality at most $\sqrt{|X|}/\ln|X|$, up to a factor.
DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries
This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.
Gram2Vec: An Interpretable Document Vectorizer
Zeng, Peter, Sclafani, Eric, Rambow, Owen
We present Gram2Vec, a grammatical style embedding algorithm that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches, Gram2Vec offers inherent interpretability based on how the feature vectors are generated. In our demo, we present a way to visualize a mapping of authors to documents based on their Gram2Vec vectors and highlight the ability to drop or add features to view which authors make certain linguistic choices. Next, we use authorship attribution as an application to show how Gram2Vec can explain why a document is attributed to a certain author, using cosine similarities between the Gram2Vec feature vectors to calculate the distances between candidate documents and a query document.
Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies
Automatic Speech Assessment (ASA) has seen notable advancements with the utilization of self-supervised features (SSL) in recent research. However, a key challenge in ASA lies in the imbalanced distribution of data, particularly evident in English test datasets. To address this challenge, we approach ASA as an ordinal classification task, introducing Weighted Vectors Ranking Similarity (W-RankSim) as a novel regularization technique. W-RankSim encourages closer proximity of weighted vectors in the output layer for similar classes, implying that feature vectors with similar labels would be gradually nudged closer to each other as they converge towards corresponding weighted vectors. Extensive experimental evaluations confirm the effectiveness of our approach in improving ordinal classification performance for ASA. Furthermore, we propose a hybrid model that combines SSL and handcrafted features, showcasing how the inclusion of handcrafted features enhances performance in an ASA system.
Neural networks in non-metric spaces
Leveraging the infinite dimensional neural network architecture we proposed in arXiv:2109.13512v4 and which can process inputs from Fr\'echet spaces, and using the universal approximation property shown therein, we now largely extend the scope of this architecture by proving several universal approximation theorems for a vast class of input and output spaces. More precisely, the input space $\mathfrak X$ is allowed to be a general topological space satisfying only a mild condition ("quasi-Polish"), and the output space can be either another quasi-Polish space $\mathfrak Y$ or a topological vector space $E$. Similarly to arXiv:2109.13512v4, we show furthermore that our neural network architectures can be projected down to "finite dimensional" subspaces with any desirable accuracy, thus obtaining approximating networks that are easy to implement and allow for fast computation and fitting. The resulting neural network architecture is therefore applicable for prediction tasks based on functional data. To the best of our knowledge, this is the first result which deals with such a wide class of input/output spaces and simultaneously guarantees the numerical feasibility of the ensuing architectures. Finally, we prove an obstruction result which indicates that the category of quasi-Polish spaces is in a certain sense the correct category to work with if one aims at constructing approximating architectures on infinite-dimensional spaces $\mathfrak X$ which, at the same time, have sufficient expressive power to approximate continuous functions on $\mathfrak X$, are specified by a finite number of parameters only and are "stable" with respect to these parameters.
Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance
Aliakbarisani, Roya, Jankowski, Robert, Serrano, M. Ángeles, Boguñá, Marián
Graph Neural Networks (GNNs) have excelled in predicting graph properties in various applications ranging from identifying trends in social networks to drug discovery and malware detection. With the abundance of new architectures and increased complexity, GNNs are becoming highly specialized when tested on a few well-known datasets. However, how the performance of GNNs depends on the topological and features properties of graphs is still an open question. In this work, we introduce a comprehensive benchmarking framework for graph machine learning, focusing on the performance of GNNs across varied network structures. Utilizing the geometric soft configuration model in hyperbolic space, we generate synthetic networks with realistic topological properties and node feature vectors. This approach enables us to assess the impact of network properties, such as topology-feature correlation, degree distributions, local density of triangles (or clustering), and homophily, on the effectiveness of different GNN architectures. Our results highlight the dependency of model performance on the interplay between network structure and node features, providing insights for model selection in various scenarios. This study contributes to the field by offering a versatile tool for evaluating GNNs, thereby assisting in developing and selecting suitable models based on specific data characteristics.
Predicting Drug-Gene Relations via Analogy Tasks with Word Embeddings
Yamagiwa, Hiroaki, Hashimoto, Ryoma, Arakane, Kiwamu, Murakami, Ken, Soeda, Shou, Oyama, Momose, Okada, Mariko, Shimodaira, Hidetoshi
Natural language processing (NLP) is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For instance, $\mathrm{\textit{king}} - \mathrm{\textit{man}} + \mathrm{\textit{woman}}$ predicts $\mathrm{\textit{queen}}$. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year.