AITopics

Country:

Asia > China > Guangdong Province > Guangzhou (0.05)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Aditya Bhaskara, Sharvaree Vadgama, Hong Xu

Greedy Sampling for Approximate Clustering in the Presence of Outliers

Neural Information Processing SystemsOct-3-2025, 00:07:58 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (19 more...)

Country: North America > United States (0.69)

Technology:

Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)

Neural Information Processing SystemsOct-2-2025, 04:02:54 GMT

Learning Representations for Time Series Clustering

Qianli Ma, Jiawei Zheng, Sen Li, Gary W. Cottrell

Neural Information Processing Systems http://nips.cc/

artificial intelligence, data mining, machine learning, (15 more...)

Country:

North America > United States (0.28)
Asia > China > Guangdong Province (0.14)

Genre: Research Report (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJan-25-2025, 02:17:31 GMT

Reviews: Coresets for Archetypal Analysis

This paper looks at the problem of archetypal analysis -- which effectively is a low-rank representation of the data that perhaps has more interpretablility. Instead of finding a low-rank subspace to represent the data, we try to represent each data point as a projection to a convex hull of k-points, where the k points themselves are convex combinations of the original data. The authors present a sampling based method to create coresets for this problem. The main intuition is that the objective function is close to (in fact upper bounded by) a k-means objective, just the "query set" has changed. Given the strong coreset guarantee, a restriction of the query set means that the existing guarantees carry over.

archetypal analysis, coreset, k-means objective, (4 more...)

Technology: Information Technology > Artificial Intelligence (0.45)

Bamas, Etienne, Nagarajan, Sai Ganesh, Svensson, Ola

An Analysis of $D^\alpha$ seeding for $k$-means

arXiv.org Artificial IntelligenceOct-20-2023

One of the most popular clustering algorithms is the celebrated $D^\alpha$ seeding algorithm (also know as $k$-means++ when $\alpha=2$) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an $O(2^{2\alpha}\cdot \log k)$-approximate solution to the ($k$,$\alpha$)-means cost (where euclidean distances are raised to the power $\alpha$) for any $\alpha\ge 1$. More recently, Balcan, Dick, and White (2018) observed experimentally that using $D^\alpha$ seeding with $\alpha>2$ can lead to a better solution with respect to the standard $k$-means objective (i.e. the $(k,2)$-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any $\alpha>2$, we show that $D^\alpha$ seeding guarantees in expectation an approximation factor of $$ O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$$ with respect to the standard $k$-means cost of any underlying clustering; where $g_\alpha$ is a parameter capturing the concentration of the points in each cluster, $\sigma_{\mathrm{max}}$ and $\sigma_{\mathrm{min}}$ are the maximum and minimum standard deviation of the clusters around their means, and $\ell$ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of $2$). We complement these results by some lower bounds showing that the dependency on $g_\alpha$ and $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using $D^\alpha$ seeding. Further, we corroborate the observation that $\alpha>2$ can indeed improve the $k$-means cost compared to $D^2$ seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.

algorithm, gaussian, potential function, (15 more...)

2310.13474

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California > Orange County > Irvine (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Barwey, Shivam, Raman, Venkat

Jacobian-Scaled K-means Clustering for Physics-Informed Segmentation of Reacting Flows

arXiv.org Artificial IntelligenceMay-2-2023

This work introduces Jacobian-scaled K-means (JSK-means) clustering, which is a physics-informed clustering strategy centered on the K-means framework. The method allows for the injection of underlying physical knowledge into the clustering procedure through a distance function modification: instead of leveraging conventional Euclidean distance vectors, the JSK-means procedure operates on distance vectors scaled by matrices obtained from dynamical system Jacobians evaluated at the cluster centroids. The goal of this work is to show how the JSK-means algorithm -- without modifying the input dataset -- produces clusters that capture regions of dynamical similarity, in that the clusters are redistributed towards high-sensitivity regions in phase space and are described by similarity in the source terms of samples instead of the samples themselves. The algorithm is demonstrated on a complex reacting flow simulation dataset (a channel detonation configuration), where the dynamics in the thermochemical composition space are known through the highly nonlinear and stiff Arrhenius-based chemical source terms. Interpretations of cluster partitions in both physical space and composition space reveal how JSK-means shifts clusters produced by standard K-means towards regions of high chemical sensitivity (e.g., towards regions of peak heat release rate near the detonation reaction zone). The findings presented here illustrate the benefits of utilizing Jacobian-scaled distances in clustering techniques, and the JSK-means method in particular displays promising potential for improving former partition-based modeling strategies in reacting flow (and other multi-physics) applications.

algorithm, artificial intelligence, machine learning, (16 more...)

2305.01539

Country: North America > United States (0.67)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Zhong, Ying, Huang, Dong, Wang, Chang-Dong

Deep Temporal Contrastive Clustering

arXiv.org Artificial IntelligenceDec-29-2022

Recently the deep learning has shown its advantage in representation learning and clustering for time series data. Despite the considerable progress, the existing deep time series clustering approaches mostly seek to train the deep neural network by some instance reconstruction based or cluster distribution based objective, which, however, lack the ability to exploit the sample-wise (or augmentation-wise) contrastive information or even the higher-level (e.g., cluster-level) contrastiveness for learning discriminative and clustering-friendly representations. In light of this, this paper presents a deep temporal contrastive clustering (DTCC) approach, which for the first time, to our knowledge, incorporates the contrastive learning paradigm into the deep time series clustering research. Specifically, with two parallel views generated from the original time series and their augmentations, we utilize two identical auto-encoders to learn the corresponding representations, and in the meantime perform the cluster distribution learning by incorporating a k-means objective. Further, two levels of contrastive learning are simultaneously enforced to capture the instance-level and cluster-level contrastive information, respectively. With the reconstruction loss of the auto-encoder, the cluster distribution loss, and the two levels of contrastive losses jointly optimized, the network architecture is trained in a self-supervised manner and the clustering result can thereby be obtained. Experiments on a variety of time series datasets demonstrate the superiority of our DTCC approach over the state-of-the-art.

artificial intelligence, deep learning, machine learning, (16 more...)

2212.14366

Country:

North America > United States > Arizona (0.04)
Asia > China > Guangdong Province (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)

Ghadiri, Mehrdad, Samadi, Samira, Vempala, Santosh

Fair k-Means Clustering

arXiv.org Artificial IntelligenceJun-17-2020

We show that the popular $k$-means clustering algorithm (Lloyd's heuristic), used for a variety of scientific data, can result in outcomes that are unfavorable to subgroups of data (e.g., demographic groups). Such biased clusterings can have deleterious implications for human-centric applications such as resource allocation. We present a fair $k$-means objective and algorithm to choose cluster centers that provide equitable costs for different groups. The algorithm, Fair-Lloyd, is a modification of Lloyd's heuristic for $k$-means, inheriting its simplicity, efficiency, and stability. In comparison with standard Lloyd's, we find that on benchmark data sets, Fair-Lloyd exhibits unbiased performance by ensuring that all groups have balanced costs in the output $k$-clustering, while incurring a negligible increase in running time, thus making it a viable fair option wherever $k$-means is currently used.

algorithm, artificial intelligence, machine learning, (17 more...)

2006.10085

Country: North America > United States > California (0.46)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningMay-10-2019

Proportionally Fair Clustering

Chen, Xingyu, Fain, Brandon, Lyu, Charles, Munagala, Kamesh

The data points in machine learning are often real human beings. There is legitimate concern that traditional machine learning algorithms that are blind to this fact may inadvertently exacerbate problems of bias and injustice in society [25]. Motivated by concerns ranging from the granting of bail in the legal system to the quality of recommender systems, researchers have devoted considerable effort to developing fair algorithms for the canonical supervised learning tasks of classification and regression [13, 28, 20, 27, 34, 11, 30, 35, 26, 18, 21]. We extend this work to a canonical problem in unsupervised learning: centroid clustering. In centroid clustering, we want to partition data into k clusters by choosing k "centers" and then matching points to one of the centers.

artificial intelligence, machine learning, objective, (17 more...)

arXiv.org Machine Learning

1905.03674

Genre: Research Report (0.64)

Industry: Law (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Ailon, Nir, Jaiswal, Ragesh, Monteleoni, Claire

Streaming k-means approximation

Neural Information Processing SystemsDec-31-2009

We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means, in which the algorithm is allowed to output more than k centers (based on the recent k-means++"), and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method."

algorithm, approximation algorithm, k-means objective, (16 more...)

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Pennsylvania (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)