Nearest Neighbor Methods
Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation
Zhang, Hao, Si, Nianwen, Chen, Yaqi, Zhang, Wenlin, Yang, Xukui, Qu, Dan, Li, Zhen
Existing techniques often attempt to make knowledge transfer from a powerful machine translation (MT) to speech translation (ST) model with some elaborate techniques, which often requires transcription as extra input during training. However, transcriptions are not always available, and how to improve the ST model performance without transcription, i.e., data efficiency, has rarely been studied in the literature. In this paper, we propose Decoupled Non-parametric Knowledge Distillation (DNKD) from data perspective to improve the data efficiency. Our method follows the knowledge distillation paradigm. However, instead of obtaining the teacher distribution from a sophisticated MT model, we construct it from a non-parametric datastore via k-Nearest-Neighbor (kNN) retrieval, which removes the dependence on transcription and MT model. Then we decouple the classic knowledge distillation loss into target and non-target distillation to enhance the effect of the knowledge among non-target logits, which is the prominent "dark knowledge". Experiments on MuST-C corpus show that, the proposed method can achieve consistent improvement over the strong baseline without requiring any transcription.
How Sentiment Classification works part1(Machine Learning)
Abstract: With the rapid growth of the use of social media websites, obtaining the users' feedback automatically became a crucial task to evaluate their tendencies and behaviors online. Despite this great availability of information, and the increasing number of Arabic users only few research has managed to treat Arabic dialects. The purpose of this paper is to study the opinion and emotion expressed in real Moroccan texts precisely in the YouTube comments using some well-known and commonly used methods for sentiment analysis. In this paper, we present our work of Moroccan dialect comments classification using Machine Learning (ML) models and based on our collected and manually annotated YouTube Moroccan dialect dataset. By employing many text preprocessing and data representation techniques we aim to compare our classification results utilizing the most commonly used supervised classifiers: k-nearest neighbors (KNN), Support Vector Machine (SVM), Naive Bayes (NB), and deep learning (DL) classifiers such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LTSM).
Clustering-based Imputation for Dropout Buyers in Large-scale Online Experimentation
Shen, Sumin, Mao, Huiying, Zhang, Zezhong, Chen, Zili, Nie, Keyu, Deng, Xinwei
In online experimentation, appropriate metrics (e.g., purchase) provide strong evidence to support hypotheses and enhance the decision-making process. However, incomplete metrics are frequently occurred in the online experimentation, making the available data to be much fewer than the planned online experiments (e.g., A/B testing). In this work, we introduce the concept of dropout buyers and categorize users with incomplete metric values into two groups: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clustering-based imputation method using $k$-nearest neighbors. Our proposed imputation method considers both the experiment-specific features and users' activities along their shopping paths, allowing different imputation values for different users. To facilitate efficient imputation of large-scale data sets in online experimentation, the proposed method uses a combination of stratification and clustering. The performance of the proposed method is compared to several conventional methods in both simulation studies and a real online experiment at eBay.
Neural Net and Traditional Classifiers
Previous work on nets with continuous-valued inputs led to generative procedures to construct convex decision regions with two-layer perceptrons (one hidden layer) and arbitrary decision regions with three-layer perceptrons (two hidden layers). Here we demonstrate that two-layer perceptron classifiers trained with back propagation can form both convex and disjoint decision regions. Such classifiers are robust, train rapidly, and provide good performance with simple decision regions. When complex decision regions are required, however, convergence time can be excessively long and performance is often no better than that of k-nearest neighbor classifiers. Three neural net classifiers are presented that provide more rapid training under such situations. Two use fixed weights in the first one or two layers and are similar to classifiers that estimate probability density functions using histograms. A third "feature map classifier" uses both unsupervised and supervised training. It provides good performance with little supervised training in situations such as speech recognition where much unlabeled training data is available. The architecture of this classifier can be used to implement a neural net k-nearest neighbor classifier.
Practical Characteristics of Neural Network and Conventional Pattern Classifiers on Artificial and Speech Problems
Eight neural net and conventional pattern classifiers (Bayesian(cid:173) unimodal Gaussian, k-nearest neighbor, standard back-propagation, adaptive-stepsize back-propagation, hypersphere, feature-map, learn(cid:173) ing vector quantizer, and binary decision tree) were implemented on a serial computer and compared using two speech recognition and two artificial tasks. Error rates were statistically equivalent on almost all tasks, but classifiers differed by orders of magnitude in memory requirements, training time, classification time, and ease of adaptivity. Nearest-neighbor classifiers trained rapidly but re(cid:173) quired the most memory. Tree classifiers provided rapid classifica(cid:173) tion but were complex to adapt. Back-propagation classifiers typ(cid:173) ically required long training times and had intermediate memory requirements.
Asymptotic slowing down of the nearest-neighbor classifier
Although the value of the coefficient a depends upon the underlying probability distributions, the exponent of M is largely distri(cid:173) bution free. We thus obtain a concise relation between a classifier's ability to generalize from a finite reference sample and the dimensionality of the feature space, as well as an analytic validation of Bellman's well known "curse of dimensionality."
A Comparative Study of the Practical Characteristics of Neural Network and Conventional Pattern Classifiers
Seven different pattern classifiers were implemented on a serial computer and compared using artificial and speech recognition tasks. Two neural network (radial basis function and high order polynomial GMDH network) and five conventional classifiers (Gaussian mixture, linear tree, K nearest neighbor, KD-tree, and condensed K nearest neighbor) were evaluated. Classifiers were chosen to be representative of different approaches to pat(cid:173) tern classification and to complement and extend those evaluated in a previous study (Lee and Lippmann, 1989). This and the previous study both demonstrate that classification error rates can be equivalent across different classifiers when they are powerful enough to form minimum er(cid:173) ror decision regions, when they are properly tuned, and when sufficient training data is available. Practical characteristics such as training time, classification time, and memory requirements, however, can differ by or(cid:173) ders of magnitude.
Using Genetic Algorithms to Improve Pattern Classification Performance
Genetic algorithms were used to select and create features and to select reference exemplar patterns for machine vision and speech pattern classi(cid:173) fication tasks. For a complex speech recognition task, genetic algorithms required no more computation time than traditional approaches to feature selection but reduced the number of input features required by a factor of five (from 153 to 33 features). On a difficult artificial machine-vision task, genetic algorithms were able to create new features (polynomial functions of the original features) which reduced classification error rates from 19% to almost 0%. Neural net and k nearest neighbor (KNN) classifiers were unable to provide such low error rates using only the original features. Ge(cid:173) netic algorithms were also used to reduce the number of reference exemplar patterns for a KNN classifier.
Efficient Pattern Recognition Using a New Transformation Distance
Memory-based classification algorithms such as radial basis func(cid:173) tions or K-nearest neighbors typically rely on simple distances (Eu(cid:173) clidean, dot product ...), which are not particularly meaningful on pattern vectors. More complex, better suited distance measures are often expensive and rather ad-hoc (elastic matching, deformable templates). We propose a new distance measure which (a) can be made locally invariant to any set of transformations of the input and (b) can be computed efficiently. We tested the method on large handwritten character databases provided by the Post Office and the NIST. Using invariances with respect to translation, rota(cid:173) tion, scaling, shearing and line thickness, the method consistently outperformed all other systems tested on the same databases.
Locally Adaptive Nearest Neighbor Algorithms
Four versions of a k-nearest neighbor algorithm with locally adap(cid:173) tive k are introduced and compared to the basic k-nearest neigh(cid:173) bor algorithm (kNN). Locally adaptive kNN algorithms choose the value of k that should be used to classify a query by consulting the results of cross-validation computations in the local neighborhood of the query. Local kNN methods are shown to perform similar to kNN in experiments with twelve commonly used data sets. Encour(cid:173) aging results in three constructed tasks show that local methods can significantly outperform kNN in specific applications. Local methods can be recommended for on-line learning and for appli(cid:173) cations where different regions of the input space are covered by patterns solving different sub-tasks.