Clustering
Bioinspired Cortex-based Fast Codebook Generation
Yucel, Meric, Bagis, Serdar, Sertbas, Ahmet, Sarikaya, Mehmet, Ustundag, Burak Berk
A major archetype of artificial intelligence is developing algorithms facilitating temporal efficiency and accuracy while boosting the generalization performance. Even with the latest developments in machine learning, a key limitation has been the inefficient feature extraction from the initial data, which is essential in performance optimization. Here, we introduce a feature extraction method inspired by sensory cortical networks in the brain. Dubbed as bioinspired cortex, the algorithm provides convergence to orthogonal features from streaming signals with superior computational efficiency while processing data in compressed form. We demonstrate the performance of the new algorithm using artificially created complex data by comparing it with the commonly used traditional clustering algorithms, such as Birch, GMM, and K-means. While the data processing time is significantly reduced, seconds versus hours, encoding distortions remain essentially the same in the new algorithm providing a basis for better generalization. Although we show herein the superior performance of the cortex model in clustering and vector quantization, it also provides potent implementation opportunities for machine learning fundamental components, such as reasoning, anomaly detection and classification in large scope applications, e.g., finance, cybersecurity, and healthcare.
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data
Mouret, Florian, Hippert-Ferrer, Alexandre, Pascal, Frédéric, Tourneret, Jean-Yves
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be not robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or come from a non-Gaussian distributions. To overcome this issue, a new expectation maximization algorithm is investigated for mixtures of elliptical distributions with the nice property of handling potential missing data. The complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework thanks to its conditional distribution, which is shown to be a Student distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
Mapping the Buried Cable by Ground Penetrating Radar and Gaussian-Process Regression
Zhou, Xiren, Chen, Qiuju, Lyu, Shengfei, Chen, Huanhuan
With the rapid expansion of urban areas and the increasingly use of electricity, the need for locating buried cables is becoming urgent. In this paper, a noval method to locate underground cables based on Ground Penetrating Radar (GPR) and Gaussian-process regression is proposed. Firstly, the coordinate system of the detected area is conducted, and the input and output of locating buried cables are determined. The GPR is moved along the established parallel detection lines, and the hyperbolic signatures generated by buried cables are identified and fitted, thus the positions and depths of some points on the cable could be derived. On the basis of the established coordinate system and the derived points on the cable, the clustering method and cable fitting algorithm based on Gaussian-process regression are proposed to find the most likely locations of the underground cables. Furthermore, the confidence intervals of the cable's locations are also obtained. Both the position and depth noises are taken into account in our method, ensuring the robustness and feasibility in different environments and equipments. Experiments on real-world datasets are conducted, and the obtained results demonstrate the effectiveness of the proposed method.
A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using Literals
Santini, Cristian, Gesese, Genet Asefa, Peroni, Silvio, Gangemi, Aldo, Sack, Harald, Alam, Mehwish
Data available in scholarly knowledge graphs (SKGs) - i.e., "a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent potentially different relations between these entities" [14] - is growing continuously every day, leading to a plethora of challenges concerning, for instance, article exploration and visualization [17], article recommendation [3], citation recommendation [11], and Author Name Disambiguation (AND) [24], which is relevant for the purposes of the present article. In particular, AND refers to a specific task of entity resolution which aims at resolving author mentions in bibliographic references to real-world people. Author persistent identifiers, such as ORCIDs and VIAFs, simplify the AND activity since such identifiers can be used for reconciling entities defined as different objects and representing the same real-world person. However, the availability of such persistent identifiers in SKGs - such as OpenCitations (OC) [22], AMiner [27] and Microsoft Academic Knowledge Graph (MAKG) [10] - is characterized by very low coverage and, as such, additional and computationally-oriented techniques must be adopted to identify different authors as the same person. In the past, many automatic approaches have been developed to automatically address AND by using publications metadata (e.g., title, abstract, keywords, venue, affiliation, etc.) to extract some features which can be used in the disambiguation task. These methods vary widely from supervised learning methods to unsupervised learning including recently developed deep neural network-based architectures [31]. However, the existing SKGs do not provide all the relevant contextual information necessary to reuse effectively and efficiently such approaches, that often rely on pure textual data. In contrast with the approaches mentioned above, this study focuses on performing AND for scholarly data represented as linked data or included in SKGs by considering the multi-modal information available in such collections, i.e., the structural information consisting of entities and relations between them as well as text or numeric values associated with the authors and publications defined in the form of literals (family name, given name, publication title, venue title, year of publication, etc.). The proposed framework to address this task is named Literally Author Name Disambiguation (LAND), which focuses on tackling the following research questions: - Can Knowledge Graph Embeddings (KGEs) - i.e. a technique that enables the creation of a "dense representation of the graph in a continuous, low-dimensional vector space that can then be used for machine learning tasks"[13] - be used effectively for the downstream task of clustering, more specifically for author name disambiguation?
Topological data analysis and clustering
With the advent of Big Data, algorithms that try to extract information from them are ubiquitous. Clustering algorithms are a subcategory of Machine Learning algorithms with a wide range of applications. Notions like closeness, distance and shape are central to clustering. It is then natural to try to use ideas and techniques from topology to improve clustering algorithms. This chapter examines some ways on how this could happen. In Section 2 a brief introduction to the clustering task is presented. In Subsection 2.1 a definition is presented along with notation.
Learning with latent group sparsity via heat flow dynamics on networks
Ghosh, Subhroshekhar, Mukherjee, Soumendu Sundar
Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat flow-based local network dynamics. In fact, we demonstrate a procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the heat flow dynamics for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. We validate our approach by successful applications to real-world data from a wide array of application domains, including computer science, genetics, climatology and economics. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.
Lead-lag detection and network clustering for multivariate time series with an application to the US equity market
Bennett, Stefanos, Cucuringu, Mihai, Reinert, Gesine
In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a lead-lag structure amongst the time series variables. In this paper, we propose a method for the detection of lead-lag clusters of time series in multivariate systems. We demonstrate that the web of pairwise lead-lag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of lead-lag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise lead-lag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate lead-lag time series systems and daily real-world US equity prices data. We showcase that our method is able to detect statistically significant lead-lag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on lead-lag relations and demonstrate how these can be used for the construction of predictive financial signals.
TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters
Lee, Dongha, Shen, Jiaming, Kang, SeongKu, Yoon, Susik, Han, Jiawei, Yu, Hwanjo
Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
Superpixel Pre-Segmentation of HER2 Slides for Efficient Annotation
Öttl, Mathias, Mönius, Jana, Marzahl, Christian, Rübner, Matthias, Geppert, Carol I., Hartmann, Arndt, Beckmann, Matthias W., Fasching, Peter, Maier, Andreas, Erber, Ramona, Breininger, Katharina
Supervised deep learning has shown state-of-the-art performance for medical image segmentation across different applications, including histopathology and cancer research; however, the manual annotation of such data is extremely laborious. In this work, we explore the use of superpixel approaches to compute a pre-segmentation of HER2 stained images for breast cancer diagnosis that facilitates faster manual annotation and correction in a second step. Four methods are compared: Standard Simple Linear Iterative Clustering (SLIC) as a baseline, a domain adapted SLIC, and superpixels based on feature embeddings of a pretrained ResNet-50 and a denoising autoencoder. To tackle oversegmentation, we propose to hierarchically merge superpixels, based on their content in the respective feature space. When evaluating the approaches on fully manually annotated images, we observe that the autoencoder-based superpixels achieve a 23% increase in boundary F1 score compared to the baseline SLIC superpixels. Furthermore, the boundary F1 score increases by 73% when hierarchical clustering is applied on the adapted SLIC and the autoencoder-based superpixels. These evaluations show encouraging first results for a pre-segmentation for efficient manual refinement without the need for an initial set of annotated training data.
Multiway Spherical Clustering via Degree-Corrected Tensor Block Models
We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.