Supervised Learning
Conan-embedding: General Text Embedding with More and Better Negative Samples
Li, Shiyu, Tang, Yang, Chen, Shizhe, Chen, Xi
With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
Sigma Flows for Image and Data Labeling and Learning Structured Prediction
Cassel, Jonas, Boll, Bastian, Petra, Stefania, Albers, Peter, Schnรถrr, Christoph
This paper introduces the sigma flow model for the prediction of structured labelings of data observed on Riemannian manifolds, including Euclidean image domains as special case. The approach combines the Laplace-Beltrami framework for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi about 25 years ago, and the assignment flow approach introduced and studied by the authors. The sigma flow arises as Riemannian gradient flow of generalized harmonic energies and thus is governed by a nonlinear geometric PDE which determines a harmonic map from a closed Riemannian domain manifold to a statistical manifold, equipped with the Fisher-Rao metric from information geometry. A specific ingredient of the sigma flow is the mutual dependency of the Riemannian metric of the domain manifold on the evolving state. This makes the approach amenable to machine learning in a specific way, by realizing this dependency through a mapping with compact time-variant parametrization that can be learned from data. Proof of concept experiments demonstrate the expressivity of the sigma flow model and prediction performance. Structural similarities to transformer network architectures and networks generated by the geometric integration of sigma flows are pointed out, which highlights the connection to deep learning and, conversely, may stimulate the use of geometric design principles for structured prediction in other areas of scientific machine learning.
Symplectic Bregman divergences
We present a generalization of Bregman divergences in symplectic vector spaces that we term symplectic Bregman divergences. Symplectic Bregman divergences are derived from a symplectic generalization of the Fenchel-Young inequality which relies on the notion of symplectic subdifferentials. The symplectic Fenchel-Young inequality is obtained using the symplectic Fenchel transform which is defined with respect to the symplectic form. Since symplectic forms can be generically built from pairings of dual systems, we get a generalization of Bregman divergences in dual systems obtained by equivalent symplectic Bregman divergences. In particular, when the symplectic form is derived from an inner product, we show that the corresponding symplectic Bregman divergences amount to ordinary Bregman divergences with respect to composite inner products. Some potential applications of symplectic divergences in geometric mechanics, information geometry, and learning dynamics in machine learning are touched upon.
Machine Learning for Quantifier Selection in cvc5
Jakubลฏv, Jan, Janota, Mikolรกลก, Piepenbrock, Jelle, Urban, Josef
In this work we considerably improve the state-of-the-art SMT solving on first-order quantified problems by efficient machine learning guidance of quantifier selection. Quantifiers represent a significant challenge for SMT and are technically a source of undecidability. In our approach, we train an efficient machine learning model that informs the solver which quantifiers should be instantiated and which not. Each quantifier may be instantiated multiple times and the set of the active quantifiers changes as the solving progresses. Therefore, we invoke the ML predictor many times, during the whole run of the solver. To make this efficient, we use fast ML models based on gradient boosting decision trees. We integrate our approach into the state-of-the-art cvc5 SMT solver and show a considerable increase of the system's holdout-set performance after training it on a large set of first-order problems collected from the Mizar Mathematical Library.
An Information-Theoretic Approach to Generalization Theory
Rodrรญguez-Gรกlvez, Borja, Thobaben, Ragnar, Skoglund, Mikael
We investigate the in-distribution generalization of machine learning algorithms. We depart from traditional complexity-based approaches by analyzing information-theoretic bounds that quantify the dependence between a learning algorithm and the training data. We consider two categories of generalization guarantees: 1) Guarantees in expectation: These bounds measure performance in the average case. Here, the dependence between the algorithm and the data is often captured by information measures. While these measures offer an intuitive interpretation, they overlook the geometry of the algorithm's hypothesis class. Here, we introduce bounds using the Wasserstein distance to incorporate geometry, and a structured, systematic method to derive bounds capturing the dependence between the algorithm and an individual datum, and between the algorithm and subsets of the training data. 2) PAC-Bayesian guarantees: These bounds measure the performance level with high probability. Here, the dependence between the algorithm and the data is often measured by the relative entropy. We establish connections between the Seeger--Langford and Catoni's bounds, revealing that the former is optimized by the Gibbs posterior. We introduce novel, tighter bounds for various types of loss functions. To achieve this, we introduce a new technique to optimize parameters in probabilistic statements. To study the limitations of these approaches, we present a counter-example where most of the information-theoretic bounds fail while traditional approaches do not. Finally, we explore the relationship between privacy and generalization. We show that algorithms with a bounded maximal leakage generalize. For discrete data, we derive new bounds for differentially private algorithms that guarantee generalization even with a constant privacy parameter, which is in contrast to previous bounds in the literature.
Hodgkinson targets 800m world record set in 1983
Olympic 1500m bronze medal winner Georgia Bell said she is still undecided about whether to become a full-time athlete. The 30-year-old only returned to running three years ago having fallen out of love with the sport. Bell still works for a a cyber security software company in London. "I've been on a break over the summer to focus on the Olympics and the plan is to go back in September," she said. "Work have been super-supportive and we'll see what happens. I think it will be really difficult to balance both. So it's something I'm going to think about."
RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
Chen, Meng, Zhang, Kai, He, Zhenying, Jing, Yinan, Wang, X. Sean
Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality (e.g., texts) as the query to retrieve the most similar items from another (e.g., images or videos). However, there is an inherent distribution gap between embeddings from different modalities, and cross-modal queries become Out-of-Distribution (OOD) to the base data. Consequently, state-of-the-art ANNS approaches suffer poor performance for OOD workloads. In this paper, we quantitatively analyze the properties of the OOD workloads to gain an understanding of their ANNS efficiency. Unlike single-modal workloads, we reveal OOD queries spatially deviate from base data, and the k-nearest neighbors of an OOD query are distant from each other in the embedding space. The property breaks the assumptions of existing ANNS approaches and mismatches their design for efficient search. With insights from the OOD workloads, we propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution. Extensive experiments show that RoarGraph significantly outperforms state-of-the-art approaches on modern cross-modal datasets, achieving up to 3.56x faster search speed at a 90% recall rate for OOD queries.
The Z-Gromov-Wasserstein Distance
Bauer, Martin, Mรฉmoli, Facundo, Needham, Tom, Nishino, Mao
The Gromov-Wasserstein (GW) distance is a powerful tool for comparing metric measure spaces which has found broad applications in data science and machine learning. Driven by the need to analyze datasets whose objects have increasingly complex structure (such as node and edge-attributed graphs), several variants of GW distance have been introduced in the recent literature. With a view toward establishing a general framework for the theory of GW-like distances, this paper considers a vast generalization of the notion of a metric measure space: for an arbitrary metric space $Z$, we define a $Z$-network to be a measure space endowed with a kernel valued in $Z$. We introduce a method for comparing $Z$-networks by defining a generalization of GW distance, which we refer to as $Z$-Gromov-Wasserstein ($Z$-GW) distance. This construction subsumes many previously known metrics and offers a unified approach to understanding their shared properties. The paper demonstrates that the $Z$-GW distance defines a metric on the space of $Z$-networks which retains desirable properties of $Z$, such as separability, completeness, and geodesicity. Many of these properties were unknown for existing variants of GW distance that fall under our framework. Our focus is on foundational theory, but our results also include computable lower bounds and approximations of the distance which will be useful for practical applications.
CarbonClipper: Optimal Algorithms for Carbon-Aware Spatiotemporal Workload Management
Lechowicz, Adam, Christianson, Nicolas, Sun, Bo, Bashir, Noman, Hajiesmaili, Mohammad, Wierman, Adam, Shenoy, Prashant
We study carbon-aware spatiotemporal workload management, which seeks to address the growing environmental impact of data centers. We formalize this as an online problem called spatiotemporal online allocation with deadline constraints ($\mathsf{SOAD}$), in which an online player completes a workload (e.g., a batch compute job) by moving and scheduling the workload across a network subject to a deadline $T$. At each time step, a service cost function is revealed, representing, e.g., the carbon intensity of servicing a workload at each location, and the player must irrevocably decide the current allocation. Furthermore, whenever the player moves the allocation, it incurs a movement cost defined by a metric space $(X,d)$ that captures, e.g., the overhead of migrating a compute job. $\mathsf{SOAD}$ formalizes the open problem of combining general metrics and deadline constraints in the online algorithms literature, unifying problems such as metrical task systems and online search. We propose a competitive algorithm for $\mathsf{SOAD}$ along with a matching lower bound that proves it is optimal. Our main algorithm, ${\rm C{\scriptsize ARBON}C{\scriptsize LIPPER}}$, is a learning-augmented algorithm that takes advantage of predictions (e.g., carbon intensity forecasts) and achieves an optimal consistency-robustness trade-off. We evaluate our proposed algorithms for carbon-aware spatiotemporal workload management on a simulated global data center network, showing that ${\rm C{\scriptsize ARBON}C{\scriptsize LIPPER}}$ significantly improves performance compared to baseline methods and delivers meaningful carbon reductions.
A Structural Feature-Based Approach for Comprehensive Graph Classification
Islam, Saiful, Hasan, Md. Nahid, Khanra, Pitambar
The increasing prevalence of graph-structured data across various domains has intensified greater interest in graph classification tasks. While numerous sophisticated graph learning methods have emerged, their complexity often hinders practical implementation. In this article, we address this challenge by proposing a method that constructs feature vectors based on fundamental graph structural properties. We demonstrate that these features, despite their simplicity, are powerful enough to capture the intrinsic characteristics of graphs within the same class. We explore the efficacy of our approach using three distinct machine learning methods, highlighting how our feature-based classification leverages the inherent structural similarities of graphs within the same class to achieve accurate classification. A key advantage of our approach is its simplicity, which makes it accessible and adaptable to a broad range of applications, including social network analysis, bioinformatics, and cybersecurity. Furthermore, we conduct extensive experiments to validate the performance of our method, showing that it not only reveals a competitive performance but in some cases surpasses the accuracy of more complex, state-of-the-art techniques. Our findings suggest that a focus on fundamental graph features can provide a robust and efficient alternative for graph classification, offering significant potential for both research and practical applications.