AITopics | hartigan

Collaborating Authors

hartigan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An Axiomatic Definition of Hierarchical Clustering

Arias-Castro, Ery, Coda, Elizabeth

arXiv.org Machine LearningJul-3-2024

Clustering, informally understood as the grouping of data, is a central task in statistics and computer science with broad applications. Modern clustering algorithms originated in the work of numerical taxonomists, who developed methods to identify hierarchical structures in the classification of plant and animal species. Since then clustering has been used in disciplines such as medicine, astronomy, anthropology, economics, etc., with aims such as exploratory analysis, data summarization, the identification of salient structures in data, and information organization. The notion of a "good" or "accurate" clustering varies between fields and applications. For example, to some computer scientists, the correct clustering of a dataset is often defined as the solution to an optimization problem (think K-means) and a good algorithm either solves or approximates a solution to this problem, ideally with some guarantees [14, 31].

algorithm, axiom, cluster tree, (15 more...)

arXiv.org Machine Learning

2407.03574

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Level Sets or Gradient Lines? A Unifying View of Modal Clustering

Arias-Castro, Ery, Qiao, Wanli

arXiv.org Machine LearningSep-17-2021

Up until the 1970's there were two main ways of clustering points in space. One of them, perhaps pioneered by Pearson [44], was to fit a (usually Gaussian) mixture to the data, and that being done, classify each data point -- as well as any other point available at a later date -- according to the most likely component in the mixture. The other one was based on a direct partitioning of the space, most notably by minimization of the average minimum squared distance to a center: the K-means problem, whose computational difficulty led to a number of famous algorithms [22, 31, 36, 37, 39] and likely played a role in motivating the development of hierarchical clustering [21, 25, 54, 63]. In the 1970's, two decidedly nonparametric approaches to clustering were proposed, both based on the topography given by the population density. Of course, in practice, the density is estimated, often by some form of kernel density estimation.

cluster tree, gradient flow, gradient line, (16 more...)

arXiv.org Machine Learning

2109.08362

Country:

North America > United States > Virginia > Fairfax County > Fairfax (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

CAC: A Clustering Based Framework for Classification

Srivastava, Shivin, Bhatia, Siddharth, Huang, Lingxiao, Heng, Lim Jun, Kawaguchi, Kenji, Rajan, Vaibhav

arXiv.org Artificial IntelligenceFeb-23-2021

In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either are classifier-specific and not generic or independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature despite its importance in several real-world applications. In this paper, we theoretically analyze when and how clustering may help in obtaining accurate classifiers. We design a simple, efficient, and generic framework called Classification Aware Clustering (CAC), to find clusters that are well suited for being used as training datasets by classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.

classification, classifier, dataset, (16 more...)

arXiv.org Artificial Intelligence

2102.11872

Country:

North America > United States > California (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

k-sums: another side of k-means

Zhao, Wan-Lei, Chen, Run-Qing, Ye, Hui, Ngo, Chong-Wah

arXiv.org Machine LearningMay-19-2020

In this paper, the decades-old clustering method k-means is revisited. The original distortion minimization model of k-means is addressed by a pure stochastic minimization procedure. In each step of the iteration, one sample is tentatively reallocated from one cluster to another. It is moved to another cluster as long as the reallocation allows the sample to be closer to the new centroid. This optimization procedure converges faster to a better local minimum over k-means and many of its variants. This fundamental modification over the k-means loop leads to the redefinition of a family of k-means variants. Moreover, a new target function that minimizes the summation of pairwise distances within clusters is presented. We show that it could be solved under the same stochastic optimization procedure. This minimization procedure built upon two minimization models outperforms k-means and its variants considerably with different settings and on different datasets.

artificial intelligence, centroid, machine learning, (19 more...)

arXiv.org Machine Learning

2005.09485

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Fujian Province > Xiamen (0.04)
North America > United States > New York (0.04)
North America > United States > New Jersey > Hudson County > Secaucus (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Energy Clustering

França, Guilherme, Vogelstein, Joshua T.

arXiv.org Machine LearningOct-26-2017

Energy statistics was proposed by Sz\'{e}kely in the 80's inspired by the Newtonian gravitational potential from classical mechanics, and it provides a hypothesis test for equality of distributions. It was further generalized from Euclidean spaces to metric spaces of strong negative type, and more recently, a connection with reproducing kernel Hilbert spaces (RKHS) was established. Here we consider the clustering problem from an energy statistics theory perspective, providing a precise mathematical formulation yielding a quadratically constrained quadratic program (QCQP) in the associated RKHS, thus establishing the connection with kernel methods. We show that this QCQP is equivalent to kernel $k$-means optimization problem once the kernel is fixed. These results imply a first principles derivation of kernel $k$-means from energy statistics. However, energy statistics fixes a family of standard kernels. Furthermore, we also consider a weighted version of energy statistics, making connection to graph partitioning problems. To find local optimizers of such QCQP we propose an iterative algorithm based on Hartigan's method, which in this case has the same computational cost as kernel $k$-means algorithm, based on Lloyd's heuristic, but usually with better clustering quality. We provide carefully designed numerical experiments showing the superiority of the proposed method compared to kernel $k$-means, spectral clustering, standard $k$-means, and Gaussian mixture models in a variety of settings.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1710.09859

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

Nielsen, Frank, Nock, Richard

arXiv.org Machine LearningJun-22-2014

Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.

artificial intelligence, hartigan, machine learning, (17 more...)

arXiv.org Machine Learning

1406.6314

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Hartigan's K-Means Versus Lloyd's K-Means — Is It Time for a Change?

Slonim, Noam (IBM Haifa Research Lab) | Aharoni, Ehud (IBM Haifa Research Lab) | Crammer, Koby (The Technion Department of Electrical Engineering)

AAAI ConferencesAug-3-2013

Hartigan's method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd's algorithm. E.g., it was recently shown that the set of local minima of Hartigan's algorithm is a subset of those of Lloyd's method. We develop a closed-form expression that allows to establish Hartigan's method for k-means clustering with any Bregman divergence, and further strengthen the case of preferring Hartigan's algorithm over Lloyd's algorithm. Specifically, we characterize a range of problems with various noise levels of the inputs, for which any random partition represents a local minimum for Lloyd's algorithm, while Hartigan's algorithm easily converges to the correct solution. Extensive experiments on synthetic and real-world data further support our theoretical analysis.

hartigan, k-means versus lloyd

AAAI Conferences

Twenty-Third International Joint Conference on Artificial Intelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback