AITopics

1905.04191

Country:

Asia > Middle East > Jordan (0.05)
Asia > China > Beijing > Beijing (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningMay-10-2019

Proportionally Fair Clustering

Chen, Xingyu, Fain, Brandon, Lyu, Charles, Munagala, Kamesh

The data points in machine learning are often real human beings. There is legitimate concern that traditional machine learning algorithms that are blind to this fact may inadvertently exacerbate problems of bias and injustice in society [25]. Motivated by concerns ranging from the granting of bail in the legal system to the quality of recommender systems, researchers have devoted considerable effort to developing fair algorithms for the canonical supervised learning tasks of classification and regression [13, 28, 20, 27, 34, 11, 30, 35, 26, 18, 21]. We extend this work to a canonical problem in unsupervised learning: centroid clustering. In centroid clustering, we want to partition data into k clusters by choosing k "centers" and then matching points to one of the centers.

artificial intelligence, machine learning, objective, (17 more...)

1905.03674

Genre: Research Report (0.64)

Industry: Law (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Khan, Shujaat, Naseem, Imran, Togneri, Roberto, Bennamoun, Mohammed

A Novel Adaptive Kernel for the RBF Neural Networks

Abstract--In this paper, we propose a novel adaptive kernel for the radial basis function (RBF) neural networks. In [12] a novel RBF network with the multi-kernel is proposed to obtain an optimized and I. INTRODUCTION The unknown centres of the multikernels The RBF neural networks have shown excellent performance are determined by an improved k-means clustering in a number of problems of practical interest. An orthogonal least squares (OLS) algorithm is reservoirs of brine are analyzed for physicochemical properties used to determine the remaining parameters. The convergence of the ACA is analyzed by the [3] the RBF kernel is used to predict the pressure gradient Lyapunov criterion. In the context of nuclear physics, RBF Cognitive Radial Basis Function network (McRBFN) and its has been effectively used to model the stopping power data Projection based Learning (PBL) referred to as PBL-McRBFN of materials as in [4].

kernel, neural network, upstream oil & gas, (20 more...)

doi: 10.1007/s00034-016-0375-7

1905.03546

Country:

North America > United States (0.28)
Oceania > Australia (0.15)
Asia > Pakistan (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.69)

Integrating Tensor Similarity to Enhance Clustering Performance

Peng, Hong, Chen, Jiazhou, Wang, Haiyan, Hu, Yu, Cai, Hongmin

Clustering aims to separate observed data into different categories. The performance of popular clustering models relies on the sample-to-sample similarity. However, the pairwise similarity is prone to be corrupted by noise or outliers and thus deteriorates the subsequent clustering. A high-order relationship among samples-to-samples may elaborate the local manifold of the data and thus provide complementary information to guide the clustering. However, few studies have investigated the connection between high-order similarity and usual pairwise similarity. To fill this gap, we first define a high-order tensor similarity to exploit the samples-to-samples affinity relationship. We then establish the connection between tensor similarity and pairwise similarity, proving that the decomposable tensor similarity is the Kronecker product of the usual pairwise similarity and the non-decomposable tensor similarity is generalized to provide complementary information, which pairwise similarity fails to regard. Finally, the high-order tensor similarity and pairwise similarity (IPS2) were integrated collaboratively to enhance clustering performance by enjoying their merits. The proposed IPS2 is shown to perform superior or competitive to state-of-the-art methods on synthetic and real-world datasets. Extensive experiments demonstrated that tensor similarity is capable to boost the performance of the classical clustering method.

artificial intelligence, machine learning, similarity, (17 more...)

1905.0392

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Fujiwara, Takanori, Kwon, Oh-Hyun, Ma, Kwan-Liu

Supporting Analysis of Dimensionality Reduction Results with Contrastive Learning

Dimensionality reduction (DR) is frequently used for analyzing and visualizing high-dimensional data as it provides a good first glance of the data. However, to interpret the DR result for gaining useful insights from the data, it would take additional analysis effort such as identifying clusters and understanding their characteristics. While there are many automatic methods (e.g., density-based clustering methods) to identify clusters, effective methods for understanding a cluster's characteristics are still lacking. A cluster can be mostly characterized by its distribution of feature values. Reviewing the original feature values is not a straightforward task when the number of features is large. To address this challenge, we present a visual analytics method that effectively highlights the essential features of a cluster in a DR result. To extract the essential features, we introduce an enhanced usage of contrastive principal component analysis (cPCA). Our method can calculate each feature's relative contribution to the contrast between one cluster and other clusters. With our cPCA-based method, we have created an interactive system including a scalable visualization of clusters' feature contributions. We demonstrate the effectiveness of our method and system with case studies using several publicly available datasets.

artificial intelligence, dataset, machine learning, (16 more...)

1905.03911

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Tennis (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Wang, Shu, Yabes, Jonathan G., Chang, Chung-Chou H.

A Bayesian Finite Mixture Model with Variable Selection for Data with Mixed-type Variables

Finite mixture model is an important branch of clustering methods and can be applied on data sets with mixed types of variables. However, challenges exist in its applications. First, it typically relies on the EM algorithm which could be sensitive to the choice of initial values. Second, biomarkers subject to limits of detection (LOD) are common to encounter in clinical data, which brings censored variables into finite mixture model. Additionally, researchers are recently getting more interest in variable importance due to the increasing number of variables that become available for clustering. To address these challenges, we propose a Bayesian finite mixture model to simultaneously conduct variable selection, account for biomarker LOD and obtain clustering results. We took a Bayesian approach to obtain parameter estimates and the cluster membership to bypass the limitation of the EM algorithm. To account for LOD, we added one more step in Gibbs sampling to iteratively fill in biomarker values below or above LODs. In addition, we put a spike-and-slab type of prior on each variable to obtain variable importance. Simulations across various scenarios were conducted to examine the performance of this method. Real data application on electronic health records was also conducted.

artificial intelligence, machine learning, variable selection, (18 more...)

1905.0368

Genre: Research Report > Experimental Study (0.48)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.54)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.86)

Bidirectional RNN-based Few-shot Training for Detecting Multi-stage Attack

Zhao, Di, Liu, Jiqiang, Wang, Jialin, Niu, Wenjia, Tong, Endong, Chen, Tong, Li, Gang

"Feint Attack", as a new type of APT attack, has become the focus of attention. It adopts a multi-stage attacks mode which can be concluded as a combination of virtual attacks and real attacks. Under the cover of virtual attacks, real attacks can achieve the real purpose of the attacker, as a result, it often caused huge losses inadvertently. However, to our knowledge, all previous works use common methods such as Causal-Correlation or Cased-based to detect outdated multi-stage attacks. Few attentions have been paid to detect the "Feint Attack", because the difficulty of detection lies in the diversification of the concept of "Feint Attack" and the lack of professional datasets, many detection methods ignore the semantic relationship in the attack. Aiming at the existing challenge, this paper explores a new method to solve the problem. In the attack scenario, the fuzzy clustering method based on attribute similarity is used to mine multi-stage attack chains. Then we use a few-shot deep learning algorithm (SMOTE&CNN-SVM) and bidirectional Recurrent Neural Network model (Bi-RNN) to obtain the "Feint Attack" chains. "Feint Attack" is simulated by the real attack inserted in the normal causal attack chain, and the addition of the real attack destroys the causal relationship of the original attack chain. So we used Bi-RNN coding to obtain the hidden feature of "Feint Attack" chain. In the end, our method achieved the goal to detect the "Feint Attack" accurately by using the LLDoS1.0 and LLDoS2.0 of DARPA2000 and CICIDS2017 of Canadian Institute for Cybersecurity.

artificial intelligence, feint attack, machine learning, (18 more...)

1905.03454

Country: Asia > China (0.16)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Wang, Shu, Yabes, Jonathan G., Chang, Chung-Chou H.

Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables

arXiv.org Machine LearningMay-6-2019

Clustering is an essential technique for discovering patterns in data. The steady increase in amount and complexity of data over the years led to improvements and development of new clustering algorithms. However, algorithms that can cluster data with mixed variable types (continuous and categorical) remain limited, despite the abundance of data with mixed types particularly in the medical field. Among existing methods for mixed data, some posit unverifiable distributional assumptions or that the contributions of different variable types are not well balanced. We propose a two-step hybrid density- and partition-based algorithm (HyDaP) that can detect clusters after variables selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and recognize the important variables for clustering; the second step involves partition-based algorithm together with a novel dissimilarity measure we designed for mixed data to obtain clustering results. Simulations across various scenarios and data structures were conducted to examine the performance of the HyDaP algorithm compared to commonly used methods. We also applied the HyDaP algorithm on electronic health records to identify sepsis phenotypes.

artificial intelligence, data mining, machine learning, (18 more...)

1905.02257

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
Health & Medicine > Health Care Technology > Medical Record (0.54)
Health & Medicine > Therapeutic Area > Oncology (0.46)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Epasto, Alessandro, Perozzi, Bryan

Is a Single Embedding Enough? Learning Node Representations that Capture Multiple Social Contexts

arXiv.org Machine LearningMay-6-2019

Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph -- a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to $90\%$. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.

artificial intelligence, data mining, machine learning, (17 more...)

doi: 10.1145/3308558.3313660

1905.02138

Country:

North America > United States > New York (0.28)
North America > United States > California > San Francisco County > San Francisco (0.16)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.46)
Information Technology > Services (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

arXiv.org Machine LearningMay-5-2019

Fast communication-efficient spectral clustering over distributed data

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wu, Guodong, Wang, Honggang

The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume {\it all the data are already in one place}, and divide the data and conquer on multiple machines. However, it is increasingly often that the data are located at a number of distributed sites, and one wishes to compute over all the data with low communication overhead. For spectral clustering, we propose a novel framework that enables its computation over such distributed data, with "minimal" communications while a major speedup in computation. The loss in accuracy is negligible compared to the non-distributed setting. Our approach allows local parallel computing at where the data are located, thus turns the distributed nature of the data into a blessing; the speedup is most substantial when the data are evenly distributed across sites. Experiments on synthetic and large UC Irvine datasets show almost no loss in accuracy with our approach while about 2x speedup under various settings with two distributed sites. As the transmitted data need not be in their original form, our framework readily addresses the privacy concern for data sharing in distributed computing.

data mining, machine learning, spectral, (21 more...)

doi: 10.1109/TBDATA.2019.2907985

1905.01596

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
Asia > Middle East > Jordan (0.05)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Communications (1.00)
(4 more...)