AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

SMIXS: Novel efficient algorithm for non-parametric mixture regression-based clustering

Mlakar, Peter, Nummi, Tapio, Oblak, Polona, Pucer, Jana Faganeli

arXiv.org Artificial IntelligenceSep-19-2022

We investigate a novel non-parametric regression-based clustering algorithm for longitudinal data analysis. Combining natural cubic splines with Gaussian mixture models (GMM), the algorithm can produce smooth cluster means that describe the underlying data well. However, there are some shortcomings in the algorithm: high computational complexity in the parameter estimation procedure and a numerically unstable variance estimator. Therefore, to further increase the usability of the method, we incorporated approaches to reduce its computational complexity, we developed a new, more stable variance estimator, and we developed a new smoothing parameter estimation procedure. We show that the developed algorithm, SMIXS, performs better than GMM on a synthetic dataset in terms of clustering and regression performance. We demonstrate the impact of the computational speed-ups, which we formally prove in the new framework. Finally, we perform a case study by using SMIXS to cluster vertical atmospheric measurements to determine different weather regimes.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2209.0903

Country:

Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.05)
North America > United States > Maryland > Baltimore (0.04)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

HiPart: Hierarchical Divisive Clustering Toolbox

Anagnostou, Panagiotis, Tasoulis, Sotiris, Plagianakos, Vassilis, Tasoulis, Dimitris

arXiv.org Artificial IntelligenceSep-18-2022

This paper presents the HiPart package, an open-source native python library that provides efficient and interpret-able implementations of divisive hierarchical clustering algorithms. HiPart supports interactive visualizations for the manipulation of the execution steps allowing the direct intervention of the clustering outcome. This package is highly suited for Big Data applications as the focus has been given to the computational efficiency of the implemented clustering methodologies. The dependencies used are either Python build-in packages or highly maintained stable external packages. The software is provided under the MIT license.

data mining, machine learning, programming language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21105/joss.05024

2209.0868

Country:

Europe > Greece (0.05)
Asia > Middle East > Republic of Türkiye > Erzurum Province > Erzurum (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Adaptive Dimension Reduction and Variational Inference for Transductive Few-Shot Classification

Hu, Yuqing, Pateux, Stéphane, Gripon, Vincent

arXiv.org Artificial IntelligenceSep-18-2022

Transductive Few-Shot learning has gained increased attention nowadays considering the cost of data annotations along with the increased accuracy provided by unlabelled samples in the domain of few shot. Especially in Few-Shot Classification (FSC), recent works explore the feature distributions aiming at maximizing likelihoods or posteriors with respect to the unknown parameters. Following this vein, and considering the parallel between FSC and clustering, we seek for better taking into account the uncertainty in estimation due to lack of data, as well as better statistical properties of the clusters associated with each class. Therefore in this paper we propose a new clustering method based on Variational Bayesian inference, further improved by Adaptive Dimension Reduction based on Probabilistic Linear Discriminant Analysis. Our proposed method significantly improves accuracy in the realistic unbalanced transductive setting on various Few-Shot benchmarks when applied to features used in previous studies, with a gain of up to $6\%$ in accuracy. In addition, when applied to balanced setting, we obtain very competitive results without making use of the class-balance artefact which is disputable for practical use cases. We also provide the performance of our method on a high performing pretrained backbone, with the reported results further surpassing the current state-of-the-art accuracy, suggesting the genericity of the proposed method.

artificial intelligence, bayesian inference, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2209.08527

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > France > Brittany > Finistère > Brest (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

ASAP: Adaptive Scheme for Asynchronous Processing of Event-based Vision Algorithms

Tapia, Raul, Eguíluz, Augusto Gómez, Dios, José Ramiro Martínez-de, Ollero, Anibal

arXiv.org Artificial IntelligenceSep-18-2022

Event cameras can capture pixel-level illumination changes with very high temporal resolution and dynamic range. They have received increasing research interest due to their robustness to lighting conditions and motion blur. Two main approaches exist in the literature to feed the event-based processing algorithms: packaging the triggered events in event packages and sending them one-by-one as single events. These approaches suffer limitations from either processing overflow or lack of responsivity. Processing overflow is caused by high event generation rates when the algorithm cannot process all the events in real-time. Conversely, lack of responsivity happens in cases of low event generation rates when the event packages are sent at too low frequencies. This paper presents ASAP, an adaptive scheme to manage the event stream through variable-size packages that accommodate to the event package processing times. The experimental results show that ASAP is capable of feeding an asynchronous event-by-event clustering algorithm in a responsive and efficient manner and at the same time prevents overflow.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2209.08597

Country: Europe > Spain (0.04)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Add feedback

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Azhir, Elham, Hosseinzadeh, Mehdi, Khan, Faheem, Mosavi, Amir

arXiv.org Artificial IntelligenceSep-17-2022

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.07143

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry: Telecommunications (0.46)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Enhanced Fairness Testing via Generating Effective Initial Individual Discriminatory Instances

Ma, Minghua, Tian, Zhao, Hort, Max, Sarro, Federica, Zhang, Hongyu, Lin, Qingwei, Zhang, Dongmei

arXiv.org Artificial IntelligenceSep-17-2022

Fairness testing aims at mitigating unintended discrimination in the decision-making process of data-driven AI systems. Individual discrimination may occur when an AI model makes different decisions for two distinct individuals who are distinguishable solely according to protected attributes, such as age and race. Such instances reveal biased AI behaviour, and are called Individual Discriminatory Instances (IDIs). In this paper, we propose an approach for the selection of the initial seeds to generate IDIs for fairness testing. Previous studies mainly used random initial seeds to this end. However this phase is crucial, as these seeds are the basis of the follow-up IDIs generation. We dubbed our proposed seed selection approach I&D. It generates a large number of initial IDIs exhibiting a great diversity, aiming at improving the overall performance of fairness testing. Our empirical study reveal that I&D is able to produce a larger number of IDIs with respect to four state-of-the-art seed generation approaches, generating 1.68X more IDIs on average. Moreover, we compare the use of I&D to train machine learning models and find that using I&D reduces the number of remaining IDIs by 29% when compared to the state-of-the-art, thus indicating that I&D is effective for improving model fairness

artificial intelligence, idis, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2209.08321

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > China > Beijing > Beijing (0.04)
Oceania > Australia > New South Wales > Callaghan (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective

Yang, Li, Shami, Abdallah

arXiv.org Artificial IntelligenceSep-16-2022

With the wide spread of sensors and smart devices in recent years, the data generation speed of the Internet of Things (IoT) systems has increased dramatically. In IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges, specifically, effective model selection, design/tuning, and updating, which have brought massive demand for experienced data scientists. Additionally, the dynamic nature of IoT data may introduce concept drift issues, causing model performance degradation. To reduce human efforts, Automated Machine Learning (AutoML) has become a popular field that aims to automatically select, construct, tune, and update machine learning models to achieve the best performance on specified tasks. In this paper, we conduct a review of existing methods in the model selection, tuning, and updating procedures in the area of AutoML in order to identify and summarize the optimal solutions for every step of applying ML algorithms to IoT data analytics. To justify our findings and help industrial users and researchers better implement AutoML approaches, a case study of applying AutoML to IoT anomaly detection problems is conducted in this work. Lastly, we discuss and classify the challenges and research directions for this domain.

evolutionary algorithm, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.engappai.2022.105366

2209.08018

Country:

North America > Canada > Ontario > Middlesex County > London (0.27)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(10 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.87)

Industry:

Information Technology > Smart Houses & Appliances (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(4 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(8 more...)

Add feedback

Interactions in Information Spread

Poux-Médard, Gaël

arXiv.org Artificial IntelligenceSep-16-2022

Since the development of writing 5000 years ago, human-generated data gets produced at an ever-increasing pace. Classical archival methods aimed at easing information retrieval. Nowadays, archiving is not enough anymore. The amount of data that gets generated daily is beyond human comprehension, and appeals for new information retrieval strategies. Instead of referencing every single data piece as in traditional archival techniques, a more relevant approach consists in understanding the overall ideas conveyed in data flows. To spot such general tendencies, a precise comprehension of the underlying data generation mechanisms is required. In the rich literature tackling this problem, the question of information interaction remains nearly unexplored. First, we investigate the frequency of such interactions. Building on recent advances made in Stochastic Block Modelling, we explore the role of interactions in several social networks. We find that interactions are rare in these datasets. Then, we wonder how interactions evolve over time. Earlier data pieces should not have an everlasting influence on ulterior data generation mechanisms. We model this using dynamic network inference advances. We conclude that interactions are brief. Finally, we design a framework that jointly models rare and brief interactions based on Dirichlet-Hawkes Processes. We argue that this new class of models fits brief and sparse interaction modelling. We conduct a large-scale application on Reddit and find that interactions play a minor role in this dataset. From a broader perspective, our work results in a collection of highly flexible models and in a rethinking of core concepts of machine learning. Consequently, we open a range of novel perspectives both in terms of real-world applications and in terms of technical contributions to machine learning.

data mining, information retrieval, machine learning, (26 more...)

arXiv.org Artificial Intelligence

2209.08026

Country:

South America > Brazil (0.14)
Europe > France > Île-de-France > Paris > Paris (0.13)
Europe > Germany (0.04)
(42 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)

Industry:

Media > News (1.00)
Media > Music (1.00)
Leisure & Entertainment (1.00)
(3 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
(9 more...)

Add feedback

ProjB: An Improved Bilinear Biased ProjE model for Knowledge Graph Completion

Moattari, Mojtaba, Vahdati, Sahar, Zulkernine, Farhana

arXiv.org Artificial IntelligenceSep-15-2022

Knowledge Graph Embedding (KGE) methods have gained enormous attention from a wide range of AI communities including Natural Language Processing (NLP) for text generation, classification and context induction. Embedding a huge number of inter-relationships in terms of a small number of dimensions, require proper modeling in both cognitive and computational aspects. Recently, numerous objective functions regarding cognitive and computational aspects of natural languages are developed. Among which are the state-of-the-art methods of linearity, bilinearity, manifold-preserving kernels, projection-subspace, and analogical inference. However, the major challenge of such models lies in their loss functions that associate the dimension of relation embeddings to corresponding entity dimension. This leads to inaccurate prediction of corresponding relations among entities when counterparts are estimated wrongly. ProjE KGE, published by Bordes et al., due to low computational complexity and high potential for model improvement, is improved in this work regarding all translative and bilinear interactions while capturing entity nonlinearity. Experimental results on benchmark Knowledge Graphs (KGs) such as FB15K and WN18 show that the proposed approach outperforms the state-of-the-art models in entity prediction task using linear and bilinear methods and other recent powerful ones. In addition, a parallel processing structure is proposed for the model in order to improve the scalability on large KGs. The effects of different adaptive clustering and newly proposed sampling approaches are also explained which prove to be effective in improving the accuracy of knowledge graph completion.

machine learning, natural language, relation, (20 more...)

arXiv.org Artificial Intelligence

2209.0239

Country:

Europe > Germany > Saxony > Leipzig (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Canada > Ontario > Kingston (0.04)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

On Language Clustering: A Non-parametric Statistical Approach

Chattopadhyay, Anagh, Ghosh, Soumya Sankar, Karmakar, Samir

arXiv.org Artificial IntelligenceSep-14-2022

Any approach aimed at pasteurizing and quantifying a particular phenomenon must include the use of robust statistical methodologies for data analysis. With this in mind, the purpose of this study is to present statistical approaches that may be employed in nonparametric nonhomogeneous data frameworks, as well as to examine their application in the field of natural language processing and language clustering. Furthermore, this paper discusses the many uses of nonparametric approaches in linguistic data mining and processing. The data depth idea allows for the centre-outward ordering of points in any dimension, resulting in a new nonparametric multivariate statistical analysis that does not require any distributional assumptions. The concept of hierarchy is used in historical language categorisation and structuring, and it aims to organise and cluster languages into subfamilies using the same premise. In this regard, the current study presents a novel approach to language family structuring based on non-parametric approaches produced from a typological structure of words in various languages, which is then converted into a Cartesian framework using MDS. This statistical-depth-based architecture allows for the use of data-depth-based methodologies for robust outlier detection, which is extremely useful in understanding the categorization of diverse borderline languages and allows for the re-evaluation of existing classification systems. Other depth-based approaches are also applied to processes such as unsupervised and supervised clustering. This paper therefore provides an overview of procedures that can be applied to nonhomogeneous language classification systems in a nonparametric framework.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-27609-5_4

2209.0672

Country:

North America > United States (0.14)
Asia > India > West Bengal > Kolkata (0.04)
Asia > India > Madhya Pradesh > Bhopal (0.04)
(2 more...)

Genre:

Research Report (1.00)
Overview (0.74)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)

Add feedback