Goto

Collaborating Authors

 Clustering


Review for NeurIPS paper: Higher-Order Spectral Clustering of Directed Graphs

Neural Information Processing Systems

Summary and Contributions: The paper considers a graph clustering on directed graphs. The authors introduce a new notion of clustering objective denoted by flow ratio. For any ordered partition of vertex set V into k pairwise disjoint subset (S0, ..., Sk-1), the flow ratio of the partition is sum of the average flow (i.e. The optimal clustering is the partitioning of V that maximizes the flow ratio among all possible partitions. The authors represent the directed graph using the Hermitian adjacency matrix.


Reviews: Flattening a Hierarchical Clustering through Active Learning

Neural Information Processing Systems

This paper derives complexity results for active learning queries to hierarchical clustering. The result is a partition or "cut", c, of the cluster tree, where the "flat" clustering is defined by the clusters at the leaves of a subtree of nodes AB(c) that have the same root as the original cluster tree. Learning occurs by making pairwise judgments on items (leaf nodes). All pairwise judgments form a "ground truth" matrix \Sigma. Given consistency conditions, this is an equivalent way to represent a clustering.


Reviews: Flattening a Hierarchical Clustering through Active Learning

Neural Information Processing Systems

The reviewers appreciate the fact that the algorithm can achieve sharp query complexity guarantees under challenging noisy settings. The only weakness of the paper is motivation - what is a practical scenario where we have these two sources of data?


Fuel Efficiency Analysis of the Public Transportation System Based on the Gaussian Mixture Model Clustering

arXiv.org Artificial Intelligence

Public transportation is a major source of greenhouse gas emissions, highlighting the need to improve bus fuel efficiency. Clustering algorithms assist in analyzing fuel efficiency by grouping data into clusters, but irrelevant features may complicate the analysis and choosing the optimal number of clusters remains a challenging task. Therefore, this paper employs the Gaussian mixture models to cluster the solo fuel-efficiency dataset. Moreover, an integration method that combines the Silhouette index, Calinski-Harabasz index, and Davies-Bouldin index is developed to select the optimal cluster numbers. A dataset with 4006 bus trips in North Jutland, Denmark is utilized as the case study. Trips are first split into three groups, then one group is divided further, resulting in four categories: extreme, normal, low, and extremely low fuel efficiency. A preliminary study using visualization analysis is conducted to investigate how driving behaviors and route conditions affect fuel efficiency. The results indicate that both individual driving habits and route characteristics have a significant influence on fuel efficiency.


CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently achieved impressive results in complex reasoning tasks through Chain of Thought (CoT) prompting. However, most existing CoT methods rely on using the same prompts, whether manually designed or automatically generated, to handle the entire dataset. This one-size-fits-all approach may fail to meet the specific needs arising from the diversities within a single dataset. To solve this problem, we propose the Clustered Distance-Weighted Chain of Thought (CDW-CoT) method, which dynamically constructs prompts tailored to the characteristics of each data instance by integrating clustering and prompt optimization techniques. Our method employs clustering algorithms to categorize the dataset into distinct groups, from which a candidate pool of prompts is selected to reflect the inherent diversity within the dataset. For each cluster, CDW-CoT trains the optimal prompt probability distribution tailored to their specific characteristics. Finally, it dynamically constructs a unique prompt probability distribution for each test instance, based on its proximity to cluster centers, from which prompts are selected for reasoning. CDW-CoT consistently outperforms traditional CoT methods across six datasets, including commonsense, symbolic, and mathematical reasoning tasks. Specifically, when compared to manual CoT, CDW-CoT achieves an average accuracy improvement of 25.34% on LLaMA2 (13B) and 15.72% on LLaMA3 (8B).


Learning segmentation from point trajectories

arXiv.org Artificial Intelligence

We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model -- any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.


Optimizing Portfolio Performance through Clustering and Sharpe Ratio-Based Optimization: A Comparative Backtesting Approach

arXiv.org Artificial Intelligence

Optimizing portfolio performance is a fundamental challenge in financial modeling, requiring the integration of advanced clustering techniques and data-driven optimization strategies. This paper introduces a comparative backtesting approach that combines clustering-based portfolio segmentation and Sharpe ratio-based optimization to enhance investment decision-making. First, we segment a diverse set of financial assets into clusters based on their historical log-returns using K-Means clustering. This segmentation enables the grouping of assets with similar return characteristics, facilitating targeted portfolio construction. Next, for each cluster, we apply a Sharpe ratio-based optimization model to derive optimal weights that maximize risk-adjusted returns. Unlike traditional mean-variance optimization, this approach directly incorporates the trade-off between returns and volatility, resulting in a more balanced allocation of resources within each cluster. The proposed framework is evaluated through a backtesting study using historical data spanning multiple asset classes. Optimized portfolios for each cluster are constructed and their cumulative returns are compared over time against a traditional equal-weighted benchmark portfolio.


Improving Fine-Tuning with Latent Cluster Correction

arXiv.org Artificial Intelligence

This paper proposes a novel fine-tuning method that boosts performance by optimising the formation of these latent clusters, using the Louvain community detection algorithm and a specifically designed clustering loss function. We present preliminary results that demonstrate the viability of this process on classical neural network architectures during fine-tuning on the CIFAR-100 dataset.


Reviews: Optimal Cluster Recovery in the Labeled Stochastic Block Model

Neural Information Processing Systems

Is this a fundamental bottleneck or an artifact the proof technique? What happens if we tolerate p(i, j, l) that do not depend on n? 3) As a result of the assumption that all clusters are growing linearly in n, Theorem 3 for L 2 gives suboptimal result for minimum cluster size (which is a bottleneck for clustering algorithms). In particular, the minimum cluster size has to be \Omega(n). In both the cases (convex algorithms and spectral clustering), p and q in SBM can be as small as Omega(polylog(n) / n). Minor point: It would be better to have the algorithm in the paper since a part of the paper is about guarantees for it.


Reviews: Automated scalable segmentation of neurons from multispectral images

Neural Information Processing Systems

After reading the author's rebuttal I have increased the technical quality to 2 and after reading the the other reviews I increased the potential impact to 3. The authors replied to many questions but not to all, in particular the answer was not satisfactory to the question about the parameter K which is one of the crucial parameter in any segmentation algorithm. Why they did not provide the results using the suggested automatic method in Fig4 instead of cyclying on possible (wrong) number of clusters? I would have expected to see in the results the performances with at least one auto-tuning heuristic to asses its generality (at least the one suggested by the authors). In the following the issues found in the paper: 1) In Eq(2) when constructing the adjecency matrix, the ranges of the distances d(...) and \delta(...) are the same? In the line 114 d(s) is a measure of heterogeneity, in line 125 of distance and in Eq(2) of color distance.