AITopics

doi: 10.1145/3148011.3148019

2012.08206

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > Middle East > Jordan (0.04)
North America > United States > California (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

arXiv.org Artificial IntelligenceDec-15-2020

Modeling Heterogeneous Statistical Patterns in High-dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework

Zhang, Han, Zheng, Wenhao, Chen, Charley, Gao, Kevin, Hu, Yao, Huang, Ling, Xu, Wei

Since the label collecting is prohibitive and time-consuming, unsupervised methods are preferred in applications such as fraud detection. Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions. Existing methods propose to model the data clusters on selected dimensions, yet globally omitting any dimension may damage the pattern of certain clusters. To address the above issues, we propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns. When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from normal users. Besides, FIRD also provides superior performance on anomaly detection datasets compared with SOTA anomaly detection methods (over 5% average AUC improvement). The significant experiment results on various datasets verify that the proposed method can better model the heterogeneous statistical patterns in high-dimensional data and benefit downstream applications.

dataset, dimension, fird, (13 more...)

doi: 10.1145/3366423.3380213

2012.08153

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > Illinois > Cook County > Chicago (0.04)
(18 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.68)
Law Enforcement & Public Safety > Fraud (0.67)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningDec-15-2020

Spectral Methods for Data Science: A Statistical Perspective

Chen, Yuxin, Chi, Yuejie, Fan, Jianqing, Ma, Cong

Spectral methods have emerged as a simple yet surprisingly effective approach for extracting information from massive, noisy and incomplete data. In a nutshell, spectral methods refer to a collection of algorithms built upon the eigenvalues (resp. singular values) and eigenvectors (resp. singular vectors) of some properly designed matrices constructed from data. A diverse array of applications have been found in machine learning, data science, and signal processing. Due to their simplicity and effectiveness, spectral methods are not only used as a stand-alone estimator, but also frequently employed to initialize other more sophisticated algorithms to improve performance. While the studies of spectral methods can be traced back to classical matrix perturbation theory and methods of moments, the past decade has witnessed tremendous theoretical advances in demystifying their efficacy through the lens of statistical modeling, with the aid of non-asymptotic random matrix theory. This monograph aims to present a systematic, comprehensive, yet accessible introduction to spectral methods from a modern statistical perspective, highlighting their algorithmic implications in diverse large-scale applications. In particular, our exposition gravitates around several central questions that span various applications: how to characterize the sample efficiency of spectral methods in reaching a target level of statistical accuracy, and how to assess their stability in the face of random noise, missing data, and adversarial corruptions? In addition to conventional $\ell_2$ perturbation analysis, we present a systematic $\ell_{\infty}$ and $\ell_{2,\infty}$ perturbation theory for eigenspace and singular subspaces, which has only recently become available owing to a powerful "leave-one-out" analysis framework.

classical spectral analysis, stationary distribution, statistical guarantee, (15 more...)

2012.08496

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
(9 more...)

Genre:

Research Report (0.81)
Instructional Material (0.67)

Industry:

Health & Medicine (1.00)
Information Technology (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.92)
(3 more...)

Naumov, Stanislav, Yaroslavtsev, Grigory, Avdiukhin, Dmitrii

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

arXiv.org Machine LearningDec-15-2020

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

algorithm, dataset, objective, (16 more...)

2012.08466

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.25)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Sellman, Meinolf, Shah, Tapan

Cost-sensitive Hierarchical Clustering for Dynamic Classifier Selection

arXiv.org Artificial IntelligenceDec-14-2020

We consider the dynamic classifier selection (DCS) problem: Given an ensemble of classifiers, we are to choose which classifier to use depending on the particular input vector that we get to classify. The problem is a special case of the general algorithm selection problem where we have multiple different algorithms we can employ to process a given input. We investigate if a method developed for general algorithm selection named cost-sensitive hierarchical clustering (CSHC) is suited for DCS. We introduce some additions to the original CSHC method for the special case of choosing a classification algorithm and evaluate their impact on performance. We then compare with a number of state-of-the-art dynamic classifier selection methods. Our experimental results show that our modified CSHC algorithm compares favorably

base classifier, classifier, selection, (17 more...)

2012.09608

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)

arXiv.org Artificial IntelligenceDec-14-2020

REDAT: Accent-Invariant Representation for End-to-End ASR by Domain Adversarial Training with Relabeling

Hu, Hu, Yang, Xuesong, Raeesy, Zeynab, Guo, Jinxi, Keskin, Gokce, Arsikere, Harish, Rastrow, Ariya, Stolcke, Andreas, Maas, Roland

Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, we introduce reDAT, a novel technique based on DAT, which relabels data using either unsupervised clustering or soft labels. Experiments on 23K hours of multi-accent data show that DAT achieves competitive results over accent-specific baselines on both native and non-native English accents but up to 13% relative WER reduction on unseen accents; our reDAT yields further improvements over DAT by 3% and 8% relatively on non-native accents of American and British English.

acoustic, soft label, speech recognition, (13 more...)

2012.07353

Country: North America > United States (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Golovkine, Steven, Klutchnikoff, Nicolas, Patilea, Valentin

Clustering multivariate functional data using unsupervised binary trees

arXiv.org Machine LearningDec-14-2020

Motivated by a large number of applications ranging from sports to the automotive industry and healthcare, there is a great interest in modeling observation entities in the form of a sequence of possibly vector-valued measurements, recorded intermittently at several discrete points in time. Functional data analysis (FDA) considers such data as being values on the realizations of a stochastic process, recorded with some error, at discrete random times. The purpose of FDA is to study such trajectories, also called curves or functions. See, e.g., [37, 49, 21, 54, 19] for some recent references. The amount of such data collected grows rapidly as does the cost of their labeling. Thus, there is an increasing interest in methods that aim to identify homogeneous groups within functional datasets.

algorithm, functional data, multivariate functional data, (16 more...)

2012.05973

Country:

Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

#artificialintelligenceDec-12-2020, 17:02:44 GMT

UNSUPERVISED LEARNING

Unsupervised learning is where only the input data is present and no corresponding output variable is there. Unsupervised learning has a lot of potential ranging anywhere from fraud detection to stock trading. Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Association: An association rule learning problem is where you want to discover rules that describe a large portion of your data. Association rules mining are used to identify new and interesting insights between different objects in a set, frequent pattern in transactional data or any sort of relational database.

Stephanou, Michael, Varughese, Melvin

Sequential Estimation of Nonparametric Correlation using Hermite Series Estimators

arXiv.org Machine LearningDec-11-2020

In this article we describe a new Hermite series based sequential estimator for the Spearman's rank correlation coefficient and provide algorithms applicable in both the stationary and non-stationary settings. To treat the non-stationary setting, we introduce a novel, exponentially weighted estimator for the Spearman's rank correlation, which allows the local nonparametric correlation of a bivariate data stream to be tracked. To the best of our knowledge this is the first algorithm to be proposed for estimating a time-varying Spearman's rank correlation that does not rely on a moving window approach. We explore the practical effectiveness of the Hermite series based estimators through real data and simulation studies demonstrating good practical performance. The simulation studies in particular reveal competitive performance compared to an existing algorithm. The potential applications of this work are manifold. The Hermite series based Spearman's rank correlation estimator can be applied to fast and robust online calculation of correlation which may vary over time. Possible machine learning applications include, amongst others, fast feature selection and hierarchical clustering on massive data sets.

estimator, hermite series, spearman, (14 more...)

2012.06287

Country:

Africa > South Africa > Western Cape > Cape Town (0.04)
Oceania > Australia > Western Australia > Perth (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report (0.81)

Industry: Banking & Finance (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

#artificialintelligenceDec-10-2020, 04:40:42 GMT

Clustering 101: How to Choose the Right Algorithm for Your Application

Perhaps one of the first machine learning algorithms anyone needs to go through during their data science journey is clustering algorithms. These algorithms are quite well-known among data scientists regardless of their application scope or research topic. Whenever you're working on a data science project, the chances are you will -- at some point, to some extent -- use various clustering techniques to either prepare your data for further analysis or whether initial insights from the data. Clustering techniques can be convenient in preparing and organizing unstructured and unclassified data for further analysis. The reason behind such algorithms' fame is that they provide a simple and fast approach to perform an initial analysis of the data and gain valuable insights into the nature of that data.

algorithm, application, dataset, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)