# Supervised Learning

### How do machine learning professionals use structured prediction?

Justin Stoltzfus is a freelance writer for various Web and print publications. His work has appeared in online magazines including Preservation Online, a project of the National Historic Trust, and many other venues.

### Talk to Me: Nvidia Claims NLP Inference, Training Records

Nvidia says it's achieved significant advances in conversation natural language processing (NLP) training and inference, enabling more complex, immediate-response interchanges between customers and chatbots. And the company says it has a new language training model in the works that dwarfs existing ones. Nvidia said its DGX-2 AI platform trained the BERT-Large AI language model in less than an hour and performed AI inference in 2 milliseconds making "it possible for developers to use state-of-the-art language understanding for large-scale applications…." Training: Running the largest version of Bidirectional Encoder Representations from Transformers (BERT-Large) language model, an Nvidia DGX SuperPOD with 92 Nvidia DGX-2H systems running 1,472 V100 GPUs cut training from several days to 53 minutes. A single DGX-2 system trained BERT-Large in 2.8 days.

### Nonparametric Contextual Bandits in an Unknown Metric Space

Consider a nonparametric contextual multi-arm bandit problem where each arm $a \in [K]$ is associated to a nonparametric reward function $f_a: [0,1] \to \mathbb{R}$ mapping from contexts to the expected reward. Suppose that there is a large set of arms, yet there is a simple but unknown structure amongst the arm reward functions, e.g. finite types or smooth with respect to an unknown metric space. We present a novel algorithm which learns data-driven similarities amongst the arms, in order to implement adaptive partitioning of the context-arm space for more efficient learning. We provide regret bounds along with simulations that highlight the algorithm's dependence on the local geometry of the reward functions.

### Hottest day records set across Europe this year will soon be broken

If you suffered during the recent record-smashing heatwave across Europe, there's bad news. Such heatwaves are the new normal for countries like the UK, and we can expect even more extreme ones in the next few years, say climate scientists who have studied the event. "People say, oh this is historic and this will make history," says Friederike Otto of Oxford University in the UK. "It will probably not make history because we should expect that these records will be broken in the next few years."

### A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

This paper introduces a la carte embed-ding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings. Our method relies mainly on a linear transfor-mation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable on the fly in the future when a new text feature or rare word is encountered, even if only a single usage example is available. We introduce a new dataset showing how the a la carte method requires fewer examples of words in con-text to learn high-quality embeddings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.

### Cloud TPU Pods break AI training records Google Cloud Blog

Google Cloud's AI-optimized infrastructure makes it possible for businesses to train state-of-the-art machine learning models faster, at greater scale, and at lower cost. These advantages enabled Google Cloud Platform (GCP) to set three new performance records in the latest round of the MLPerf benchmark competition, the industry-wide standard for measuring ML performance. All three record-setting results ran on Cloud TPU v3 Pods, the latest generation of supercomputers that Google has built specifically for machine learning. These results showcased the speed of Cloud TPU Pods-- with each of the winning runs using less than two minutes of compute time. With these latest MLPerf benchmark results, Google Cloud is the first public cloud provider to outperform on-premise systems when running large-scale, industry-standard ML training workloads of Transformer, Single Shot Detector (SSD), and ResNet-50.

### Universal Bayes consistency in metric spaces

We show that a recently proposed 1-nearest-neighbor-based multiclass learning algorithm is universally strongly Bayes consistent in all metric spaces where such Bayes consistency is possible, making it an optimistically universal Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, $k$-NN and its variants are not generally universally Bayes consistent, except under additional structural assumptions, such as an inner product, a norm, finite doubling dimension, or a Besicovitch-type property. The metric spaces in which universal Bayes consistency is possible are the essentially separable ones --- a new notion that we define, which is more general than standard separability. The existence of metric spaces that are not essentially separable is independent of the ZFC axioms of set theory. We prove that essential separability exactly characterizes the existence of a universal Bayes-consistent learner for the given metric space. In particular, this yields the first impossibility result for universal Bayes consistency. Taken together, these positive and negative results resolve the open problems posed in Kontorovich, Sabato, Weiss (2017).

### Persistent homology detects curvature

In topological data analysis, persistent homology is used to study the "shape of data". Persistent homology computations are completely characterized by a set of intervals called a bar code. It is often said that the long intervals represent the "topological signal" and the short intervals represent "noise". We give evidence to dispute this thesis, showing that the short intervals encode geometric information. Specifically, we prove that persistent homology detects the curvature of disks from which points have been sampled. We describe a general computational framework for solving inverse problems using the average persistence landscape, a continuous mapping from metric spaces with a probability measure to a Hilbert space. In the present application, the average persistence landscapes of points sampled from disks of constant curvature results in a path in this Hilbert space which may be learned using standard tools from statistical and machine learning.

### Structured Output Learning with Conditional Generative Flows

Traditional structured prediction models try to learn the conditional likelihood, i.e., p(y x), to capture the relationship between the structured output y and the input features x. For many models, computing the likelihood is intractable. These models are therefore hard to train, requiring the use of surrogate objectives or variational inference to approximate likelihood. In this paper, we propose conditional Glow (c-Glow), a conditional generative flow for structured output learning. C-Glow benefits from the ability of flow-based models to compute p(y x) exactly and efficiently. Learning with c-Glow does not require a surrogate objective or performing inference during training. Once trained, we can directly and efficiently generate conditional samples to do structured prediction. We evaluate this approach on different structured prediction tasks and find c-Glow's structured outputs comparable in quality with state-of-the-art deep structured prediction approaches.

### Online Learning to Rank with Features

We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.