to

### Small Sample Spaces for Gaussian Processes

It is known that the membership in a given reproducing kernel Hilbert space (RKHS) of the samples of a Gaussian process $X$ is controlled by a certain nuclear dominance condition. However, it is less clear how to identify a "small" set of functions (not necessarily a vector space) that contains the samples. This article presents a general approach for identifying such sets. We use scaled RKHSs, which can be viewed as a generalisation of Hilbert scales, to define the sample support set as the largest set which is contained in every element of full measure under the law of $X$ in the $\sigma$-algebra induced by the collection of scaled RKHS. This potentially non-measurable set is then shown to consist of those functions that can be expanded in terms of an orthonormal basis of the RKHS of the covariance kernel of $X$ and have their squared basis coefficients bounded away from zero and infinity, a result suggested by the Karhunen-Lo\`{e}ve theorem.

### Foundations of Population-Based SHM, Part IV: The Geometry of Spaces of Structures and their Feature Spaces

One of the requirements of the population-based approach to Structural Health Monitoring (SHM) proposed in the earlier papers in this sequence, is that structures be represented by points in an abstract space. Furthermore, these spaces should be metric spaces in a loose sense; i.e. there should be some measure of distance applicable to pairs of points; similar structures should then be close in the metric. However, this geometrical construction is not enough for the framing of problems in data-based SHM, as it leaves undefined the notion of feature spaces. Interpreting the feature values on a structure-by-structure basis as a type of field over the space of structures, it seems sensible to borrow an idea from modern theoretical physics, and define feature assignments as sections in a vector bundle over the structure space. With this idea in place, one can interpret the effect of environmental and operational variations as gauge degrees of freedom, as in modern gauge field theories. This paper will discuss the various geometrical structures required for an abstract theory of feature spaces in SHM, and will draw analogies with how these structures have shown their power in modern physics. In the second part of the paper, the problem of determining the normal condition cross section of a feature bundle is addressed. The solution is provided by the application of Graph Neural Networks (GNN), a versatile non-Euclidean machine learning algorithm which is not restricted to inputs and outputs from vector spaces. In particular, the algorithm is well suited to operating directly on the sort of graph structures which are an important part of the proposed framework for PBSHM. The solution of the normal section problem is demonstrated for a heterogeneous population of truss structures for which the feature of interest is the first natural frequency.

### Moreau-Yosida $f$-divergences

Another is the family of optimal transport central to many machine learning algorithms, with distances (Villani, 2008), including the Wasserstein-1 metric. Lipschitz constrained variants recently gaining In general, variational representations are supremums attention. Inspired by this, we generalize the of integral formulas taken over sets of functions, such as the so-called tight variational representation of f-Donsker-Varadhan formula (Donsker & Varadhan, 1976) divergences in the case of probability measures for the Kullback-Leibler divergence or the Kantorovich-on compact metric spaces to be taken over the Rubinstein formula (Villani, 2008) for the Wasserstein-1 space of Lipschitz functions vanishing at an arbitrary metric. Informally speaking, one can implement (Nowozin base point, characterize functions achieving et al., 2016; Arjovsky et al., 2017) such a formula by constructing the supremum in the variational representation, a real-valued neural network taking samples from propose a practical algorithm to calculate the the two probability measures as inputs, which is then trained tight convex conjugate of f-divergences compatible to maximize the integral formula in order to approximate with automatic differentiation frameworks, the supremum, resulting in a learned proxy to the actual define the Moreau-Yosida approximation of f-divergence of said probability measures. Implementing the divergences with respect to the Wasserstein-1 metric, Kantorovich-Rubinstein formula in such a way involves and derive the corresponding variational formulas, restricting the Lipschitz constant of the neural network (Gulrajani providing a generalization of a number et al., 2017; Petzka et al., 2018; Miyato et al., 2018), of recent results, novel special cases of interest which effectively stabilizes the approximation procedure.

### Feature Stores need an HTAP Database

A Feature Store is a collection of organized and curated features used for training and serving Machine Learning models. Keeping them up to date, serving feature vectors, and creating training data sets requires a combination of transactional (OLTP) and analytical (OLAP) database processing. This kind of mixed workload database is called HTAP for hybrid transactional analytical processing. The most useful Feature Stores incorporate data pipelines that continuously keep their features up to date through either batch or real-time processing that matches the cadence of the source data. Since these features are always up to date, they provide an ideal source of feature vectors used for inferencing.

### Hard negative examples are hard, but useful

Triplet loss is an extremely common approach to distance metric learning. Representations of images from the same class are optimized to be mapped closer together in an embedding space than representations of images from different classes. Much work on triplet losses focuses on selecting the most useful triplets of images to consider, with strategies that select dissimilar examples from the same class or similar examples from different classes. The consensus of previous research is that optimizing with the \textit{hardest} negative examples leads to bad training behavior. That's a problem -- these hardest negatives are literally the cases where the distance metric fails to capture semantic similarity. In this paper, we characterize the space of triplets and derive why hard negatives make triplet loss training fail. We offer a simple fix to the loss function and show that, with this fix, optimizing with hard negative examples becomes feasible. This leads to more generalizable features, and image retrieval results that outperform state of the art for datasets with high intra-class variance.

### Linear Classifiers in Mixed Constant Curvature Spaces

Embedding methods for mixed-curvature spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Nevertheless, little is known regarding downstream learning and optimization in the embedding space. Here, we address for the first time the problem of linear classification in a product space form -- a mix of Euclidean, spherical, and hyperbolic spaces with different dimensions. First, we revisit the definition of a linear classifier on a Riemannian manifold by using geodesics and Riemannian metrics which generalize the notions of straight lines and inner products in vector spaces, respectively. Second, we prove that linear classifiers in $d$-dimensional constant curvature spaces can shatter exactly $d+1$ points: Hence, Euclidean, hyperbolic and spherical classifiers have the same expressive power. Third, we formalize linear classifiers in product space forms, describe a novel perceptron classification algorithm, and establish rigorous convergence results. We support our theoretical findings with simulation results on several datasets, including synthetic data, MNIST and Omniglot. Our results reveal that learning methods applied to small-dimensional embeddings in product space forms significantly outperform their algorithmic counterparts in Euclidean spaces.

### Rethinking Ranking-based Loss Functions: Only Penalizing Negative Instances before Positive Ones is Enough

Optimising the approximation of Average Precision (AP) has been widely studied for retrieval. Such methods consider both negative and positive instances before each target positive one according to the definition of AP. However, we argue that only penalizing negative instances before positive ones is enough, because the loss only comes from them. To this end, instead of following the AP-based loss, we propose a new loss, namely Penalizing Negative instances before Positive ones (PNP), which directly minimizes the number of negative instances before each positive one. Meanwhile, limited by the definition of AP, AP-based methods only adopt a specific gradient assignment strategy. We wonder whether there exists better ones. Instead, we systematically investigate different gradient assignment solutions via constructing derivative functions of the loss, resulting in PNP-I with increasing derivative functions and PNP-D with decreasing ones. Because of their gradient assignment strategies, PNP-I tries to make all the relevant instances together, while PNP-D only quickly corrects positive one with fewer negative instances before. Thus, PNP-D may be more suitable for real-world data, which usually contains several local clusters for one class. Extensive evaluations on three standard retrieval datasets also show that PNP-D achieves the state-of-the-art performance.

### Apple: Original 45-year-old computer with wooden case set to sell for £1.1 MILLION on eBay

An original Apple computer with a wooden case is up for sale for £1.1 million ($1.5 million) on eBay -- some 2,250 times more than its original price tag in 1976. The'Apple-1' was the first product to be developed under the Apple name by company co-founders Steve Jobs and Steve Wozniak and launched in 1976. Around 175 of 200 Apple-1 machines were sold in total, each carrying a price tag of$666.66 (equivalent to some \$3,126 today.) The fully-functional model is being sold by Krishna Blake of the US, who purchased the machine in 1978, and comes with its manuals and a cassette interface. Also included in the sale is a period Sony TV-115, which was the monitor model originally recommended by Mr Jobs to use to display the computer's output.

### Generalized Zero-shot Intent Detection via Commonsense Knowledge

Identifying user intents from natural language utterances is a crucial step in conversational systems that has been extensively studied as a supervised classification problem. However, in practice, new intents emerge after deploying an intent detection model. Thus, these models should seamlessly adapt and classify utterances with both seen and unseen intents -- unseen intents emerge after deployment and they do not have training data. The few existing models that target this setting rely heavily on the scarcely available training data and overfit to seen intents data, resulting in a bias to misclassify utterances with unseen intents into seen ones. We propose RIDE: an intent detection model that leverages commonsense knowledge in an unsupervised fashion to overcome the issue of training data scarcity. RIDE computes robust and generalizable relationship meta-features that capture deep semantic relationships between utterances and intent labels; these features are computed by considering how the concepts in an utterance are linked to those in an intent label via commonsense knowledge. Our extensive experimental analysis on three widely-used intent detection benchmarks shows that relationship meta-features significantly increase the accuracy of detecting both seen and unseen intents and that RIDE outperforms the state-of-the-art model for unseen intents.

### Disambiguation of weak supervision with exponential convergence rates

In many applications of machine learning, such as recommender systems, where an input characterizing a user should be matched with a target representing an ordering of a large number of items, accessing fully supervised data (,) is not an option. Instead, one should expect weak information on the target, which could be a list of previously taken (if items are online courses), watched (if items are plays), etc., items by a user characterized by the feature vector. This motivates weakly supervised learning, aiming at learning a mapping from inputs to targets in such a setting where tools from supervised learning can not be applied off-the-shelves. Recent applications of weakly supervised learning showcase impressive results in solving complex tasks such as action retrieval on instructional videos (Miech et al., 2019), image semantic segmentation (Papandreou et al., 2015), salient object detection (Wang et al., 2017), 3D pose estimation (Dabral et al., 2018), text-to-speech synthesis (Jia et al., 2018), to name a few. However, those applications of weakly supervised learning are usually based on clever heuristics, and theoretical foundations of learning from weakly supervised data are scarce, especially when compared to statistical learning literature on supervised learning (Vapnik, 1995; Boucheron et al., 2005; Steinwart and Christmann, 2008). We aim to provide a step in this direction. In this paper, we focus on partial labelling, a popular instance of weak supervision, approached with a structured prediction point of view Ciliberto et al. (2020). We detail this setup in Section 2. Our contributions are organized as follows.