k-means
fbefa505c8e8bf6d46f38f5277fed8d6-AuthorFeedback.pdf
We would like to point out that K-means is used only once to17 initialize the representativesets and isnot anintrinsic component ofthe online algorithm. What is important to observe21 is that N is kept constant throughout in order to reduce the storage footprint and to ensure low-complexity online22 processing. Tomaintain the list constant, for every added point another point is removed. Also, the reviewer is correct24 in observing thatN does not feature in the convergence results, which are asymptotic and do not imply anything25 about the convergence rate. Clearly, if the point dimensionm is large, it is beneficial to increaseN.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- Europe > Italy > Lazio > Rome (0.04)
- (7 more...)
- North America > United States > Illinois (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India > West Bengal > Kolkata (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
Learning Multi-type heterogeneous interacting particle systems
Lang, Quanjun, Wang, Xiong, Lu, Fei, Maggioni, Mauro
We propose a framework for the joint inference of network topology, multi-type interaction kernels, and latent type assignments in heterogeneous interacting particle systems from multi-trajectory data. This learning task is a challenging non-convex mixed-integer optimization problem, which we address through a novel three-stage approach. First, we leverage shared structure across agent interactions to recover a low-rank embedding of the system parameters via matrix sensing. Second, we identify discrete interaction types by clustering within the learned embedding. Third, we recover the network weight matrix and kernel coefficients through matrix factorization and a post-processing refinement. We provide theoretical guarantees with estimation error bounds under a Restricted Isometry Property (RIP) assumption and establish conditions for the exact recovery of interaction types based on cluster separability. Numerical experiments on synthetic datasets, including heterogeneous predator-prey systems, demonstrate that our method yields an accurate reconstruction of the underlying dynamics and is robust to noise.
- North America > United States (0.28)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Research Report (1.00)
- Workflow (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)
Perfect Clustering for Sparse Directed Stochastic Block Models
Exact recovery in stochastic block models (SBMs) is well understood in undirected settings, but remains considerably less developed for directed and sparse networks, particularly when the number of communities diverges. Spectral methods for directed SBMs often lack stability in asymmetric, low-degree regimes, and existing non-spectral approaches focus primarily on undirected or dense settings. We propose a fully non-spectral, two-stage procedure for community detection in sparse directed SBMs with potentially growing numbers of communities. The method first estimates the directed probability matrix using a neighborhood-smoothing scheme tailored to the asymmetric setting, and then applies $K$-means clustering to the estimated rows, thereby avoiding the limitations of eigen- or singular value decompositions in sparse, asymmetric networks. Our main theoretical contribution is a uniform row-wise concentration bound for the smoothed estimator, obtained through new arguments that control asymmetric neighborhoods and separate in- and out-degree effects. These results imply the exact recovery of all community labels with probability tending to one, under mild sparsity and separation conditions that allow both $γ_n \to 0$ and $K_n \to \infty$. Simulation studies, including highly directed, sparse, and non-symmetric block structures, demonstrate that the proposed procedure performs reliably in regimes where directed spectral and score-based methods deteriorate. To the best of our knowledge, this provides the first exact recovery guarantee for this class of non-spectral, neighborhood-smoothing methods in the sparse, directed setting.
Comparative Analysis of Hash-based Malware Clustering via K-Means
Thein, Aink Acrie Soe, Pitropakis, Nikolaos, Papadopoulos, Pavlos, Grierson, Sam, Jan, Sana Ullah
With the adoption of multiple digital devices in everyday life, the cyber-attack surface has increased. Adversaries are continuously exploring new avenues to exploit them and deploy malware. On the other hand, detection approaches typically employ hashing-based algorithms such as SSDeep, TLSH, and IMPHash to capture structural and behavioural similarities among binaries. This work focuses on the analysis and evaluation of these techniques for clustering malware samples using the K-means algorithm. More specifically, we experimented with established malware families and traits and found that TLSH and IMPHash produce more distinct, semantically meaningful clusters, whereas SSDeep is more efficient for broader classification tasks. The findings of this work can guide the development of more robust threat-detection mechanisms and adaptive security mechanisms.
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.34)