Clustering
Convex Relaxation Methods for Community Detection
Li, Xiaodong, Chen, Yudong, Xu, Jiaming
This paper surveys recent theoretical advances in convex optimization approaches for community detection. We introduce some important theoretical techniques and results for establishing the consistency of convex community detection under various statistical models. In particular, we discuss the basic techniques based on the primal and dual analysis. We also present results that demonstrate several distinctive advantages of convex community detection, including robustness against outlier nodes, consistency under weak assortativity, and adaptivity to heterogeneous degrees. This survey is not intended to be a complete overview of the vast literature on this fast-growing topic. Instead, we aim to provide a big picture of the remarkable recent development in this area and to make the survey accessible to a broad audience. We hope that this expository article can serve as an introductory guide for readers who are interested in using, designing, and analyzing convex relaxation methods in network analysis.
Instant Data Science
If you have skimmed through the high tech job listings lately, you've surely noticed the demand for data scientists. Although there is no consensus on the exact job definition, many companies want data experts and are willing to pay a high price for the right candidates. This article explores why good data scientists are hard to find, and explains how the shortage may soon be alleviated thanks to… data science! Let's start with a working definition: A data scientist is someone who knows how to extract insights from data. The insights could take the form of descriptive analytics for past events, predictions of future events, or persuasive analytics – a plan that can persuade people to exhibit a desired behavior such as buying your product.
How can Artificial Intelligence support your Big Data architecture? Packt Hub
Getting a big data project in place is a tough challenge. But making it deliver results is even harder. That's where artificial intelligence comes in. By integrating artificial intelligence into your big data architecture, you'll be able to better manage, and analyze data in a way that provide a substantial impact on your organization. With big data getting even bigger over the next couple of years, AI won't simply be an optional extra, it will be essential.
Causal Inference and Mechanism Clustering of a Mixture of Additive Noise Models
Hu, Shoubo, Chen, Zhitang, Nia, Vahid Partovi, Chan, Laiwan, Geng, Yanhui
The inference of the causal relationship between a pair of observed variables is a fundamental problem in science, and most existing approaches are based on one single causal model. In practice, however, observations are often collected from multiple sources with heterogeneous causal models due to certain uncontrollable factors, which renders causal analysis results obtained by a single model skeptical. In this paper, we generalize the Additive Noise Model (ANM) to a mixture model, which consists of a finite number of ANMs, and provide the condition of its causal identifiability. To conduct model estimation, we propose Gaussian Process Partially Observable Model (GPPOM), and incorporate independence enforcement into it to learn latent parameter associated with each observation. Causal inference and clustering according to the underlying generating mechanisms of the mixture model are addressed in this work. Experiments on synthetic and real data demonstrate the effectiveness of our proposed approach.
Non-linear Attributed Graph Clustering by Symmetric NMF with PU Learning
Maekawa, Seiji, Takeuch, Koh, Onizuka, Makoto
We consider the clustering problem of attributed graphs. Our challenge is how we can design an effective and efficient clustering method that precisely captures the hidden relationship between the topology and the attributes in real-world graphs. We propose Non-linear Attributed Graph Clustering by Symmetric Non-negative Matrix Factorization with Positive Unlabeled Learning. The features of our method are three holds. 1) it learns a non-linear projection function between the different cluster assignments of the topology and the attributes of graphs so as to capture the complicated relationship between the topology and the attributes in real-world graphs, 2) it leverages the positive unlabeled learning to take the effect of partially observed positive edges into the cluster assignment, and 3) it achieves efficient computational complexity, $O((n^2+mn)kt)$, where $n$ is the vertex size, $m$ is the attribute size, $k$ is the number of clusters, and $t$ is the number of iterations for learning the cluster assignment. We conducted experiments extensively for various clustering methods with various real datasets to validate that our method outperforms the former clustering methods regarding the clustering quality.
Using Eigencentrality to Estimate Joint, Conditional and Marginal Probabilities from Mixed-Variable Data: Method and Applications
Abstract--The ability to estimate joint, conditional and marginal probability distributions over some set of variables is of great utility for many common machine learning tasks. However, estimating these distributions can be challenging, particularly in the case of data containing a mix of discrete and continuous variables. This paper presents a nonparametric method for estimating these distributions directly from a dataset. The data are first represented as a graph consisting of object nodes and attribute value nodes. Depending on the distribution to be estimated, an appropriate eigenvector equation is then constructed. This equation is then solved to find the corresponding stationary distribution of the graph, from which the required distributions can then be estimated and sampled from. The paper demonstrates how the method can be applied to many common machine learning tasks including classification, regression, missing value imputation, outlier detection, random vector generation, and clustering. Being able to estimate joint, conditional and marginal probabilities from some dataset allows a broad range of useful tasks to be performed. For example, classification and regression involve predicting the value of some target variable conditional on the values of the other variables. If we can sample values from the estimated distributions, we could perform random vector generation by generating full random vectors that display the same correlations as the vectors (i.e., data points) in the original data [4], [5]. If we can estimate the joint distribution for the full dataset, then we should also be able to do this for subsets of data, leading to the use of Expectation-Maximization [6] to cluster the data [7]. Taken together, these activities form a large chunk of the tasks commonly used in machine learning. All of this depends, of course, on being able to estimate the various probabilities, and this is particularly challenging on datasets containing a complex mix of continuous and discrete variables.
Scientists determine four personality types based on new data
The new study, led by Luís Amaral of the McCormick School of Engineering, will be published Sept. 17 by the journal Nature Human Behaviour. The findings potentially could be of interest to hiring managers and mental health care providers. "People have tried to classify personality types since Hippocrates' time, but previous scientific literature has found that to be nonsense," said co-author William Revelle, professor of psychology in the Weinberg College of Arts and Sciences. People have tried to classify personality types since Hippocrates' time, but previous scientific literature has found that to be nonsense," said co-author William Revelle, professor of psychology in the Weinberg College of Arts and Sciences. "Now, these data show there are higher densities of certain personality types," said Revelle, who specializes in personality measurement, theory and research. Initially, however, Revelle was skeptical of the study's premise. The concept of personality types remains controversial in psychology, with hard scientific proof difficult to find. Previous attempts based on small research groups created results that often were not replicable. "Personality types only existed in self-help literature and did not have a place in scientific journals," said Amaral, the Erastus Otis Haven Professor of Chemical and Biological Engineering at Northwestern Engineering. "Now, we think this will change because of this study." The new research combined an alternative computational approach with data from four questionnaires with more than 1.5 million respondents from around the world obtained from John Johnson's IPIP-NEO with 120 and 300 items, respectively, the myPersonality project and the BBC Big Personality Test datasets. The questionnaires, developed by the research community over the decades, have between 44 and 300 questions. People voluntarily take the online quizzes attracted by the opportunity to receive feedback about their own personality. These data are now being made available to other researchers for independent analyses. "The thing that is really, really cool is that a study with a dataset this large would not have been possible before the web," Amaral said. "Previously, maybe researchers would recruit undergrads on campus, and maybe get a few hundred people.
Data-Driven Clustering via Parameterized Lloyd's Families
Balcan, Maria-Florina, Dick, Travis, White, Colin
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
Scientists determine four personality types based on new data
IMAGE: The four newly determined personality types are based on five widely-recognized character traits. The findings challenge existing paradigms in psychology. The new study, led by Luís Amaral of the McCormick School of Engineering, will be published Sept. 17 by the journal Nature Human Behaviour. The findings potentially could be of interest to hiring managers and mental health care providers. "People have tried to classify personality types since Hippocrates' time, but previous scientific literature has found that to be nonsense," said co-author William Revelle, professor of psychology in the Weinberg College of Arts and Sciences.
ADBSCAN: Adaptive Density-Based Spatial Clustering of Applications with Noise for Identifying Clusters with Varying Densities
Khan, Mohammad Mahmudur Rahman, Siddique, Md. Abu Bakr, Arif, Rezoana Bente, Oishe, Mahjabin Rahman
Abstract--Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm which has the high-performance rate for dataset where clusters have the constant density of data points. One of the significant attributes of this algorithm is noise cancellation. However, DBSCAN demonstrates reduced performances for clusters with different densities. Therefore, in this paper, an adaptive DBSCAN is proposed which can work significantly well for identifying clusters with varying densities. Keywords--Data mining, Clustering algorithms, Adaptive DBSCAN, Spatial clustering, Density-based methods, Eps, MinPts, Core point, Border point, Eps-neighborhood, density connected.