Supervised Learning
Why Do We Need a Validation Set in Addition to Training and Test Sets?
You may already be familiar with training and test sets. This is because you need a separate test set to evaluate your model on unseen data to increase the generalizing capability of the model. We do not test our model on the same data used for training. If we do so, the model will try to memorize data and will not generalize on new unseen data. The validation set is also a part of the original dataset.
Automated Surface Texture Analysis via Discrete Cosine Transform and Discrete Wavelet Transform
Yesilli, Melih C., Chen, Jisheng, Khasawneh, Firas A., Guo, Yang
Surface roughness and texture are critical to the functional performance of engineering components. The ability to analyze roughness and texture effectively and efficiently is much needed to ensure surface quality in many surface generation processes, such as machining, surface mechanical treatment, etc. Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) are two commonly used signal decomposition tools for surface roughness and texture analysis. Both methods require selecting a threshold to decompose a given surface into its three main components: form, waviness, and roughness. However, although DWT and DCT are part of the ISO surface finish standards, there exists no systematic guidance on how to compute these thresholds, and they are often manually selected on case by case basis. This makes utilizing these methods for studying surfaces dependent on the user's judgment and limits their automation potential. Therefore, we present two automatic threshold selection algorithms based on information theory and signal energy. We use machine learning to validate the success of our algorithms both using simulated surfaces as well as digital microscopy images of machined surfaces. Specifically, we generate feature vectors for each surface area or profile and apply supervised classification. Comparing our results with the heuristic threshold selection approach shows good agreement with mean accuracies as high as 95\%. We also compare our results with Gaussian filtering (GF) and show that while GF results for areas can yield slightly higher accuracies, our results outperform GF for surface profiles. We further show that our automatic threshold selection has significant advantages in terms of computational time as evidenced by decreasing the number of mode computations by an order of magnitude compared to the heuristic thresholding for DCT.
Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings
Wang, Dongsheng, Guo, Dandan, Zhao, He, Zheng, Huangjie, Tanwisuth, Korawat, Chen, Bo, Zhou, Mingyuan
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.
On a linear fused Gromov-Wasserstein distance for graph structured data
We present a framework for embedding graph structured data into a vector space, taking into account node features and topology of a graph into the optimal transport (OT) problem. Then we propose a novel distance between two graphs, named linearFGW, defined as the Euclidean distance between their embeddings. The advantages of the proposed distance are twofold: 1) it can take into account node feature and structure of graphs for measuring the similarity between graphs in a kernel-based framework, 2) it can be much faster for computing kernel matrix than pairwise OT-based distances, particularly fused Gromov-Wasserstein, making it possible to deal with large-scale data sets. After discussing theoretical properties of linearFGW, we demonstrate experimental results on classification and clustering tasks, showing the effectiveness of the proposed linearFGW.
Resolving label uncertainty with implicit posterior models
We propose a method for jointly inferring labels across a collection of data samples, where each sample consists of an observation and a prior belief about the label. By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs. This formulation unifies various machine learning settings; the weak beliefs can come in the form of noisy or incomplete labels, likelihoods given by a different prediction mechanism on auxiliary input, or common-sense priors reflecting knowledge about the structure of the problem at hand. We demonstrate the proposed algorithms on diverse problems: classification with negative training examples, learning from rankings, weakly and self-supervised aerial imagery segmentation, co-segmentation of video frames, and coarsely supervised text classification.
Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics
Le, Tam, Nguyen, Truyen, Phung, Dinh, Nguyen, Viet Anh
However, evaluating the OT incurs a high computational complexity in Optimal transport (OT) is a popular measure general (Peyrรฉ and Cuturi, 2019) which leads to several to compare probability distributions. However, proposals in the recent literature to address this drawback OT suffers a few drawbacks such as (i) of OT, e.g., approximate using entropic regularization a high complexity for computation, (ii) indefiniteness (Cuturi, 2013), or exploit geometric structure which limits its applicability to of supports (Rabin et al., 2011; Le et al., 2019; Le and kernel machines. In this work, we consider Nguyen, 2021). Among them, tree-Wasserstein (Evans probability measures supported on a graph and Matsen, 2012; Le et al., 2019) (TW) leverages the metric space and propose a novel Sobolev tree structure over supports to obtain a closed-form transport metric. We show that the Sobolev for fast computation. However, the requirement about transport metric yields a closed-form formula tree structure for supports may be restricted in applications.
Minimax in Geodesic Metric Spaces: Sion's Theorem and Algorithms
Zhang, Peiyuan, Zhang, Jingzhao, Sra, Suvrit
Determining whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. We take a step towards understanding certain nonconvex-nonconcave minimax problems that do remain tractable. Specifically, we study minimax problems cast in geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and transparent, as it relies on Helly's theorem only. In our second main result, we specialize to geodesically complete Riemannian manifolds: we devise and analyze the complexity of first-order methods for smooth minimax problems.
Structured Prediction Problem Archive
Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of problem characteristics including size, number of instances etc. For reference we also give a non-exhaustive selection of algorithms proposed in the literature for their solution. We hope that this central repository will make benchmarking and comparison to established works easier.
Zhang
In multi-label learning, each object is represented by a single instance while associated with a set of class labels. Due to the huge (exponential) number of possible label sets for prediction, existing approaches mainly focus on how to exploit label correlations to facilitate the learning process. Nevertheless, an intrinsic characteristic of learning from multi-label data, i.e. the widely-existing class-imbalance among labels, has not been well investigated. Generally, the number of positive training instances w.r.t. each class label is far less than its negative counterparts, which may lead to performance degradation for most multi-label learning techniques. In this paper, a new multi-label learning approach named Cross-Coupling Aggregation (COCOA) is proposed, which aims at leveraging the exploitation of label correlations as well as the exploration of class-imbalance. Briefly, to induce the predictive model on each class label, one binary-class imbalance learner corresponding to the current label and several multi-class imbalance learners coupling with other labels are aggregated for prediction. Extensive experiments clearly validate the effectiveness of the proposed approach, especially in terms of imbalance-specific evaluation metrics such as F-measure and area under the ROC curve.
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Cai, Deng, Mansimov, Elman, Lai, Yi-An, Su, Yixuan, Shu, Lei, Zhang, Yi
Recent advance in deep learning has led to rapid adoption of machine learning based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP. First, we measure and analyze model update regression in different model update settings. Next, we explore and benchmark existing techniques for reducing model update regression including model ensemble and knowledge distillation. We further propose a simple and effective method, Backward-Congruent Re-ranking (BCR), by taking into account the characteristics of structured output. Experiments show that BCR can better mitigate model update regression than model ensemble and knowledge distillation approaches.