Supervised Learning
An Exploration of State-of-the-art Methods for Offensive Language Detection
Uglow, Harrison, Zlocha, Martin, Zmyślony, Szymon
We provide a comprehensive investigation of different custom and off-the-shelf architectures as well as different approaches to generating feature vectors for offensive language detection. We also show that these approaches work well on small and noisy datasets such as on the Offensive Language Identification Dataset (OLID), so it should be possible to use them for other applications.
Sentiment Analysis on IMDB Movie Comments and Twitter Data by Machine Learning and Vector Space Techniques
Tarımer, İlhan, Çoban, Adil, Kocaman, Arif Emre
This study's goal is to create a model of sentiment analysis on a 2000 rows IMDB movie comments and 3200 Twitter data by using machine learning and vector space techniques; positive or negative preliminary information about the text is to provide. In the study, a vector space was created in the KNIME Analytics platform, and a classification study was performed on this vector space by Decision Trees, Na\"ive Bayes and Support Vector Machines classification algorithms. The conclusions obtained were compared in terms of each algorithms. The classification results for IMDB movie comments are obtained as 94,00%, 73,20%, and 85,50% by Decision Tree, Naive Bayes and SVM algorithms. The classification results for Twitter data set are presented as 82,76%, 75,44% and 72,50% by Decision Tree, Naive Bayes SVM algorithms as well. It is seen that the best classification results presented in both data sets are which calculated by SVM algorithm.
Sentence Similarity in Python using Doc2Vec – Kanoki
Numeric representation of Text documents is challenging task in machine learning and there are different ways there to create the numerical features for texts such as vector representation using Bag of Words, Tf-IDF etc.I am not going in detail what are the advantages of one over the other or which is the best one to use in which case. There are lot of good reads available to explain this. It's a Model to create the word embeddings, where it takes input as a large corpus of text and produces a vector space typically of several hundred dimesions. The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "Bank", "money" and "accounts" are often used in similar situations, with similar surrounding words like "dollar", "loan" or "credit", and according to Word2Vec they will therefore share a similar vector representation.
Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction
Luise, Giulia, Stamos, Dimitris, Pontil, Massimiliano, Ciliberto, Carlo
We study the interplay between surrogate methods for structured prediction and techniques from multitask learning designed to leverage relationships between surrogate outputs. We propose an efficient algorithm based on trace norm regularization which, differently from previous methods, does not require explicit knowledge of the coding/decoding functions of the surrogate framework. As a result, our algorithm can be applied to the broad class of problems in which the surrogate space is large or even infinite dimensional. We study excess risk bounds for trace norm regularized structured prediction, implying the consistency and learning rates for our estimator. We also identify relevant regimes in which our approach can enjoy better generalization performance than previous methods. Numerical experiments on ranking problems indicate that enforcing low-rank relations among surrogate outputs may indeed provide a significant advantage in practice.
Scaling Matters in Deep Structured-Prediction Models
Shevchenko, Aleksandr, Osokin, Anton
Deep structured-prediction energy-based models combine the expressive power of learned representations and the ability of embedding knowledge about the task at hand into the system. A common way to learn parameters of such models consists in a multistage procedure where different combinations of components are trained at different stages. The joint end-to-end training of the whole system is then done as the last fine-tuning stage. This multistage approach is time-consuming and cumbersome as it requires multiple runs until convergence and multiple rounds of hyperparameter tuning. From this point of view, it is beneficial to start the joint training procedure from the beginning. However, such approaches often unexpectedly fail and deliver results worse than the multistage ones. In this paper, we hypothesize that one reason for joint training of deep energy-based models to fail is the incorrect relative normalization of different components in the energy function. We propose online and offline scaling algorithms that fix the joint training and demonstrate their efficacy on three different tasks.
Measuring Compositionality in Representation Learning
Many machine learning algorithms represent input data with vector embeddings or discrete codes. When inputs exhibit compositional structure (e.g. objects built from parts or procedures from subroutines), it is natural to ask whether this compositional structure is reflected in the the inputs' learned representations. While the assessment of compositionality in languages has received significant attention in linguistics and adjacent fields, the machine learning literature lacks general-purpose tools for producing graded measurements of compositional structure in more general (e.g. vector-valued) representation spaces. We describe a procedure for evaluating compositionality by measuring how well the true representation-producing model can be approximated by a model that explicitly composes a collection of inferred representational primitives. We use the procedure to provide formal and empirical characterizations of compositional structure in a variety of settings, exploring the relationship between compositionality and learning dynamics, human judgments, representational similarity, and generalization.
Approximating Continuous Functions on Persistence Diagrams Using Template Functions
Perea, Jose A., Munch, Elizabeth, Khasawneh, Firas A.
The persistence diagram is an increasingly useful tool arising from the field of Topological Data Analysis. However, using these diagrams in conjunction with machine learning techniques requires some mathematical finesse. The most success to date has come from finding methods for turning persistence diagrams into vectors in $\mathbb{R}^n$ in a way which preserves as much of the space of persistence diagrams as possible, commonly referred to as featurization. In this paper, we describe a mathematical framework for featurizing the persistence diagram space using template functions. These functions are general as they are only required to be continuous, have a compact support, and separate points. We discuss two example realizations of these functions: tent functions and Chybeyshev interpolating polynomials. Both of these functions are defined on a grid superposed on the birth-lifetime plane. We then combine the resulting features with machine learning algorithms to perform supervised classification and regression on several example data sets, including manifold data, shape data, and an embedded time series from a Rossler system. Our results show that the template function approach yields high accuracy rates that match and often exceed the results of existing methods for featurizing persistence diagrams. One counter-intuitive observation is that in most cases using interpolating polynomials, where each point contributes globally to the feature vector, yields significantly better results than using tent functions, where the contribution of each point is localized to its grid cell. Along the way, we also provide a complete characterization of compact sets in persistence diagram space endowed with the bottleneck distance.
Classification with unknown class conditional label noise on non-compact feature spaces
We investigate the problem of classification in the presence of unknown class conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class conditional label noise have required that the regression function attains its extrema uniformly on sets of positive measure. We shall consider this problem in the setting of non-compact metric spaces, where the regression function need not attain its extrema. In this setting we determine the minimax optimal learning rates (up to logarithmic factors). The rate displays interesting threshold behaviour: When the regression function approaches its extrema at a sufficient rate, the optimal learning rates are of the same order as those obtained in the label-noise free setting. If the regression function approaches its extrema more gradually then classification performance necessarily degrades. In addition, we present an algorithm which attains these rates without prior knowledge of either the distributional parameters or the local density. This identifies for the first time a scenario in which finite sample rates are achievable in the label noise setting, but they differ from the optimal rates without label noise.
A General Theory for Structured Prediction with Smooth Convex Surrogates
Nowak-Vila, Alex, Bach, Francis, Rudi, Alessandro
In this work we provide a theoretical framework for structured prediction that generalizes the existing theory of surrogate methods for binary and multiclass classification based on estimating conditional probabilities with smooth convex surrogates (e.g. logistic regression). The theory relies on a natural characterization of structural properties of the task loss and allows to derive statistical guarantees for many widely used methods in the context of multilabeling, ranking, ordinal regression and graph matching. In particular, we characterize the smooth convex surrogates compatible with a given task loss in terms of a suitable Bregman divergence composed with a link function. This allows to derive tight bounds for the calibration function and to obtain novel results on existing surrogate frameworks for structured prediction such as conditional random fields and quadratic surrogates.
Hyperbolic Disk Embeddings for Directed Acyclic Graphs
Suzuki, Ryota, Takahama, Ryusuke, Onoda, Shun
Obtaining continuous representations of structural data such as directed acyclic graphs (DAGs) has gained attention in machine learning and artificial intelligence. However, embedding complex DAGs in which both ancestors and descendants of nodes are exponentially increasing is difficult. Tackling in this problem, we develop Disk Embeddings, which is a framework for embedding DAGs into quasi-metric spaces. Existing state-of-the-art methods, Order Embeddings and Hyperbolic Entailment Cones, are instances of Disk Embedding in Euclidean space and spheres respectively. Furthermore, we propose a novel method Hyperbolic Disk Embeddings to handle exponential growth of relations. The results of our experiments show that our Disk Embedding models outperform existing methods especially in complex DAGs other than trees.