AITopics

doi: 10.1109/HPEC.2017.8091045

1708.07481

Country: North America > United States > Colorado (0.28)

Genre: Research Report (0.41)

Industry: Leisure & Entertainment > Sports (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Software > Programming Languages (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

#artificialintelligenceAug-18-2017, 22:55:30 GMT

Clustering and Dimensionality Reduction: Understanding the "Magic" Behind Machine Learning – Blog Imperva

These days we hear about machine learning and artificial intelligence (AI) in all aspects of life. We see machines that learn and imitate the human brain in order to automate human processes. There are autonomous cars that learn the road conditions to drive, personal assistants we can converse with and machines that can predict what stock markets will do. In some respects, it can appear as "magic." Behind machine learning there are some fundamental, well-studied and understood techniques.

algorithm, artificial intelligence, machine learning, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.44)

Data-Driven Tree Transforms and Metrics

Mishne, Gal, Talmon, Ronen, Cohen, Israel, Coifman, Ronald R., Kluger, Yuval

We consider the analysis of high dimensional data given in the form of a matrix with columns consisting of observations and rows consisting of features. Often the data is such that the observations do not reside on a regular grid, and the given order of the features is arbitrary and does not convey a notion of locality. Therefore, traditional transforms and metrics cannot be used for data organization and analysis. In this paper, our goal is to organize the data by defining an appropriate representation and metric such that they respect the smoothness and structure underlying the data. We also aim to generalize the joint clustering of observations and features in the case the data does not fall into clear disjoint groups. For this purpose, we propose multiscale data-driven transforms and metrics based on trees. Their construction is implemented in an iterative refinement procedure that exploits the co-dependencies between features and observations. Beyond the organization of a single dataset, our approach enables us to transfer the organization learned from one dataset to another and to integrate several datasets together. We present an application to breast cancer gene expression analysis: learning metrics on the genes to cluster the tumor samples into cancer sub-types and validating the joint organization of both the genes and the samples. We demonstrate that using our approach to combine information from multiple gene expression cohorts, acquired by different profiling technologies, improves the clustering of tumor samples.

artificial intelligence, data mining, machine learning, (19 more...)

1708.05768

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Data Science > Data Mining (0.93)

Mukherjee, Soumendu Sundar, Sarkar, Purnamrita, Bickel, Peter J.

Two provably consistent divide and conquer clustering algorithms for large networks

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semi-definite programs, modularity based methods, likelihood based methods etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.

data mining, machine learning, subgraph, (18 more...)

1708.05573

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

A probabilistic approach to emission-line galaxy classification

de Souza, R. S., Dantas, M. L. L., Costa-Duarte, M. V., Feigelson, E. D., Killedar, M., Lablanche, P. -Y., Vilalta, R., Krone-Martins, A., Beck, R., Gieseke, F.

We invoke a Gaussian mixture model (GMM) to jointly analyse two traditional emission-line classification schemes of galaxy ionization sources: the Baldwin-Phillips-Terlevich (BPT) and $\rm W_{H\alpha}$ vs. [NII]/H$\alpha$ (WHAN) diagrams, using spectroscopic data from the Sloan Digital Sky Survey Data Release 7 and SEAGal/STARLIGHT datasets. We apply a GMM to empirically define classes of galaxies in a three-dimensional space spanned by the $\log$ [OIII]/H$\beta$, $\log$ [NII]/H$\alpha$, and $\log$ EW(H${\alpha}$), optical parameters. The best-fit GMM based on several statistical criteria suggests a solution around four Gaussian components (GCs), which are capable to explain up to 97 per cent of the data variance. Using elements of information theory, we compare each GC to their respective astronomical counterpart. GC1 and GC4 are associated with star-forming galaxies, suggesting the need to define a new starburst subgroup. GC2 is associated with BPT's Active Galaxy Nuclei (AGN) class and WHAN's weak AGN class. GC3 is associated with BPT's composite class and WHAN's strong AGN class. Conversely, there is no statistical evidence -- based on four GCs -- for the existence of a Seyfert/LINER dichotomy in our sample. Notwithstanding, the inclusion of an additional GC5 unravels it. The GC5 appears associated to the LINER and Passive galaxies on the BPT and WHAN diagrams respectively. Subtleties aside, we demonstrate the potential of our methodology to recover/unravel different objects inside the wilderness of astronomical datasets, without lacking the ability to convey physically interpretable results. The probabilistic classifications from the GMM analysis are publicly available within the COINtoolbox (https://cointoolbox.github.io/GMM\_Catalogue/).

artificial intelligence, data mining, machine learning, (20 more...)

doi: 10.1093/mnras/stx2156

1703.07607

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Schlueter-Kuck, Kristy L., Dabiri, John O.

Identification of individual coherent sets associated with flow trajectories using Coherent Structure Coloring

We present a method for identifying the coherent structures associated with individual Lagrangian flow trajectories even where only sparse particle trajectory data is available. The method, based on techniques in spectral graph theory, uses the Coherent Structure Coloring vector and associated eigenvectors to analyze the distance in higher-dimensional eigenspace between a selected reference trajectory and other tracer trajectories in the flow. By analyzing this distance metric in a hierarchical clustering, the coherent structure of which the reference particle is a member can be identified. This algorithm is proven successful in identifying coherent structures of varying complexities in canonical unsteady flows. Additionally, the method is able to assess the relative coherence of the associated structure in comparison to the surrounding flow. Although the method is demonstrated here in the context of fluid flow kinematics, the generality of the approach allows for its potential application to other unsupervised clustering problems in dynamical systems such as neuronal activity, gene expression, or social networks.

artificial intelligence, machine learning, particle, (18 more...)

doi: 10.1063/1.4993862

1708.05757

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Industry:

Government > Regional Government (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Steinwart, Ingo, Sriperumbudur, Bharath K., Thomann, Philipp

Adaptive Clustering Using Kernel Density Estimators

arXiv.org Machine LearningAug-17-2017

We investigate statistical properties of a clustering algorithm that receives level set estimates from a kernel density estimator and then estimates the first split in the density level cluster tree if such a split is present or detects the absence of such a split. Key aspects of our analysis include finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for chosing the kernel bandwidth. For the rates and the adaptivity we do not need continuity assumptions on the density such as H\"older continuity, but only require intuitive geometric assumptions of non-parametric nature.

algorithm 1, artificial intelligence, machine learning, (16 more...)

1708.05254

Country:

North America > United States (0.93)
Europe (0.67)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Osting, Braxton, Reeb, Todd Harry

Consistency of Dirichlet Partitions

arXiv.org Machine LearningAug-17-2017

A Dirichlet $k$-partition of a domain $U \subseteq \mathbb{R}^d$ is a collection of $k$ pairwise disjoint open subsets such that the sum of their first Laplace-Dirichlet eigenvalues is minimal. A discrete version of Dirichlet partitions has been posed on graphs with applications in data analysis. Both versions admit variational formulations: solutions are characterized by minimizers of the Dirichlet energy of mappings from $U$ into a singular space $\Sigma_k \subseteq \mathbb{R}^k$. In this paper, we extend results of N.\ Garc\'ia Trillos and D.\ Slep\v{c}ev to show that there exist solutions of the continuum problem arising as limits to solutions of a sequence of discrete problems. Specifically, a sequence of points $\{x_i\}_{i \in \mathbb{N}}$ from $U$ is sampled i.i.d.\ with respect to a given probability measure $\nu$ on $U$ and for all $n \in \mathbb{N}$, a geometric graph $G_n$ is constructed from the first $n$ points $x_1, x_2, \ldots, x_n$ and the pairwise distances between the points. With probability one with respect to the choice of points $\{x_i\}_{i \in \mathbb{N}}$, we show that as $n \to \infty$ the discrete Dirichlet energies for functions $G_n \to \Sigma_k$ $\Gamma$-converge to (a scalar multiple of) the continuum Dirichlet energy for functions $U \to \Sigma_k$ with respect to a metric coming from the theory of optimal transport. This, along with a compactness property for the aforementioned energies that we prove, implies the convergence of minimizers. When $\nu$ is the uniform distribution, our results also imply the statistical consistency statement that Dirichlet partitions of geometric graphs converge to partitions of the sampled space in the Hausdorff sense.

artificial intelligence, machine learning, partition, (18 more...)

1708.05472

Country:

North America > United States (0.46)
Europe (0.28)

Genre:

Research Report (0.70)
Instructional Material > Course Syllabus & Notes (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

arXiv.org Machine LearningAug-17-2017

Towards life cycle identification of malaria parasites using machine learning and Riemannian geometry

Mehrjou, Arash

Malaria is a serious infectious disease that is responsible for over half million deaths yearly worldwide. The major cause of these mortalities is late or inaccurate diagnosis. Manual microscopy is currently considered as the dominant diagnostic method for malaria. However, it is time consuming and prone to human errors. The aim of this paper is to automate the diagnosis process and minimize the human intervention. We have developed the hardware and software for a cost-efficient malaria diagnostic system. This paper describes the manufactured hardware and also proposes novel software to handle parasite detection and life-stage identification. A motorized microscope is developed to take images from Giemsa-stained blood smears. A patch-based unsupervised statistical clustering algorithm is proposed which offers a novel method for classification of different regions within blood images. The proposed method provides better robustness against different imaging settings. The core of the proposed algorithm is a model called Mixture of Independent Component Analysis. A manifold based optimization method is proposed that facilitates the application of the model for high dimensional data usually acquired in medical microscopy. The method was tested on 600 blood slides with various imaging conditions. The speed of the method is higher than current supervised systems while its accuracy is comparable to or better than them.

artificial intelligence, machine learning, parasite, (18 more...)

1708.052

Country:

Europe > Germany (0.46)
North America > United States (0.29)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

@machinelearnbotAug-16-2017, 00:05:27 GMT

Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. Let's describe this new algorithm and its various components, in simple English We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.

artificial intelligence, clique, machine learning, (6 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.43)