Goto

Collaborating Authors

 Clustering


Review for NeurIPS paper: From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Neural Information Processing Systems

Additional Feedback: Q1.In the end-to-end training section, do the authors learn embeddings by clustering all points together? As in are train, test, and dev points all clustered together or are each of them clustered separately? If all the points are clustered separately then it might not be a reasonable thing in practice because in practice, we do not have access to test data while training, and nor should any test data be used for doing any sort of training. If authors perform some clustering on test points as well, then it might not be reasonable to assume access to *all* test data at test time. Evaluation on test data should preferably be possible even when test data arrives in an online fashion.


Review for NeurIPS paper: From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Neural Information Processing Systems

Throughout discussion among reviewers with the author response, all reviewers agree with the novelty and the significance of the theoretical contribution of this paper, which provides approximation guarantees of the proposed embedding. While reviewers raised a concern about empirical performance regarding with computational cost and parameter tuning, they are common problems for other clustering approaches and are not crucial problems of the proposal. Hence I recommend acceptance of this paper.


Reviews: Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

This paper proposes a new approach to approximating hierarchical agglomerative clustering (HAC) by requiring that at each round, only a gamma-best merge be performed (gamma being the multiplicative approximation factor to the closest pair). Two algorithms are introduced to approximate HAC - one for Ward and one for Average linkage. In both cases, the algorithms rely on using approximate nearest neighbor ANN as a black box. In addition, a bucketing datastructure is used in Wards algorithm and a subsampling procedure in used for Average linkage to guarantee the subquadratic runtime. This is a new contribution to the theoretical literature on HAC, a provable subquadratic algorithm for (an approximation to) HAC cases other than single linkage.


Reviews: Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

This paper was very much a borderline paper, with two accept scores and one reject score. One of the concerns raised by the negative reviewer was that, while the algorithm can achieve an approximation to the best merge at each step, it is unclear how the final clustering results would compare to the standard algorithm. The authors addressed this in their rebuttal, which helped. Also, there were some issues raised about experiments as well as various minor suggestions (typos etc.). In general it seems that the concerns are mostly minor, and on the whole this paper seems to make an interesting and worthwhile contribution, so I am recommending that the paper is accepted.


Fixed-sized clusters $k$-Means

arXiv.org Artificial Intelligence

We present a $k$-means-based clustering algorithm, which optimizes the mean square error, for given cluster sizes. A straightforward application is balanced clustering, where the sizes of each cluster are equal. In the $k$-means assignment phase, the algorithm solves an assignment problem using the Hungarian algorithm. This makes the assignment phase time complexity $O(n^3)$. This enables clustering of datasets of size more than 5000 points.


An FPGA-Based Neuro-Fuzzy Sensor for Personalized Driving Assistance

arXiv.org Artificial Intelligence

Depending on their sophistication level, sensors can be classified ranging from simple sensors that directly measure single physical parameters (e.g., ambient light sensors and temperature sensors) to complex intelligent sensors, which determine parameters of the surrounding environment through wide spectrum signals (e.g., radio frequency/radar and light/video); besides measuring, they perform data processing and are enabled to carry out actuations. Whereas intelligent sensors make use of data of a different nature underneath, in which complex and nonlinear behaviors are codified; data-mining techniques used jointly with machine learning (ML) algorithms have shown adequate performance for modeling this hidden information. As intelligent sensors often rely on complex sensors and sensor fusion techniques, the data processing power they need can only be provided by high-performance computational platforms such as microprocessors, graphics-processing units (GPUs), or field-programmable gate arrays (FPGAs). In particular, FPGA-based implementations stand out due to the extremely high operational frequencies and low power consumption they can achieve, even for complex, multilayered algorithms [1]. In the context of the automotive field, intelligent sensors are key components of current assistance systems.


Reviews: Foundations of Comparison-Based Hierarchical Clustering

Neural Information Processing Systems

In this work the authors study hierarchical clustering under quadruplet comparison framework. The authors show that single and complete linkages are inherently comparison based and propose two variants of average linkage clustering exploiting quadruplet comparison. Exact hierarchy recovery guarantee is provided under planted hierarchical partition model and empirical evaluation is provided. The meaning of the variables \mu, \delta etc are hard to interpret from the description. They have been nicely summarized (and explained) in the appendix A.1.


Reviews: Foundations of Comparison-Based Hierarchical Clustering

Neural Information Processing Systems

The authors have proposed two variants of average linkage hierarchical clustering using quadruplet comparison framework. Theoretical results of hierarchy recovery is established under a suitable model. The reviewers are in agreement that the results are new and important. The authors should incorporate the suggestions made by the reviewers to further strengthen the paper.


ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis

arXiv.org Artificial Intelligence

Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.


Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

arXiv.org Artificial Intelligence

The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.