AITopics

2604.07143

Country: Europe > Poland (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

arXiv.org Machine LearningFeb-28-2025

DISCO: Internal Evaluation of Density-Based Clustering

Beer, Anna, Krieger, Lena, Weber, Pascal, Ritzert, Martin, Assent, Ira, Plant, Claudia

In density-based clustering, clusters are areas of high object density separated by lower object density areas. This notion supports arbitrarily shaped clusters and automatic detection of noise points that do not belong to any cluster. However, it is challenging to adequately evaluate the quality of density-based clustering results. Even though some existing cluster validity indices (CVIs) target arbitrarily shaped clusters, none of them captures the quality of the labeled noise. In this paper, we propose DISCO, a Density-based Internal Score for Clustering Outcomes, which is the first CVI that also evaluates the quality of noise labels. DISCO reliably evaluates density-based clusters of arbitrary shape by assessing compactness and separation. It also introduces a direct assessment of noise labels for any given clustering. Our experiments show that DISCO evaluates density-based clusterings more consistently than its competitors. It is additionally the first method to evaluate the complete labeling of density-based clustering methods, including noise labels.

dataset, disco, noise, (13 more...)

2503.00127

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.14)
Europe > Germany (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningOct-25-2023

A framework for benchmarking clustering algorithms

Gagolewski, Marek

The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .

artificial intelligence, dataset, machine learning, (16 more...)

doi: 10.1016/j.softx.2022.101270

2209.09493

Country:

Europe > Poland > Masovia Province > Warsaw (0.05)
Oceania > Australia (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre:

Research Report (0.65)
Instructional Material > Course Syllabus & Notes (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Li, Weihao, Dodwell, Emily, Cook, Dianne

A Clustering Algorithm to Organize Satellite Hotspot Data for the Purpose of Tracking Bushfires Remotely

arXiv.org Artificial IntelligenceAug-21-2023

The 2019-2020 Australia bushfire season was catastrophic in the scale of damage caused to agricultural resources, property, infrastructure, and ecological systems. By the end of 2020, the devastation attributable to these Black Summer fires totalled 33 lives lost, almost 19 million hectares of land burned, over 3,000 homes destroyed and AUD $1.7 billion in insurance losses, as well as an estimated 1 billion animals killed, including half of Kangaroo Island's population of koalas (Filkov et al. 2020). According to the Australian Government Bureau of Meteorology (2021), 2019 was the warmest year on record in Australia, and the period from 2013-2020 represents eight of the ten warmest years in recorded history. There is concern and expectation that impacts of climate change - including more extreme temperatures, persistent drought, and changes in plant growth and landscape drying - will worsen conditions for extreme bushfires (CSIRO and Australian Government Bureau of Meteorology 2020; Deb et al. 2020). Contributing to the problem is that dry lightning represents the main source of natural ignition, and fires that start in remote areas deep in the temperate forests are difficult to access and monitor (Abram et al. 2021). Therefore, opportunities to detect fire ignitions, monitor bushfire spread, and understand movement patterns in remote areas are important for developing effective strategies to mitigate bushfire impact.

artificial intelligence, data mining, machine learning, (19 more...)

2308.10505

Country:

Europe > Austria > Vienna (0.14)
Oceania > Australia > Victoria (0.04)
South America > Brazil (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Government > Regional Government > Oceania Government > Australia Government (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJul-31-2023

Detecting the Anomalies in LiDAR Pointcloud

Zhang, Chiyu, Han, Ji, Zou, Yao, Dong, Kexin, Li, Yujia, Ding, Junchun, Han, Xiaoling

LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR is generating anomalous pointcloud by analyzing the pointcloud characteristics. Specifically, we develop a pointcloud quality metric based on the LiDAR points' spatial and intensity distribution to characterize the noise level of the pointcloud, which relies on pure mathematical analysis and does not require any labeling or training as learning-based methods do. Therefore, the method is scalable and can be quickly deployed either online to improve the autonomy safety by monitoring anomalies in the LiDAR data or offline to perform in-depth study of the LiDAR behavior over large amount of data. The proposed approach is studied with extensive real public road data collected by LiDARs with different scanning mechanisms and laser spectrums, and is proven to be able to effectively handle various known and unknown sources of pointcloud anomaly.

artificial intelligence, machine learning, pointcloud, (14 more...)

2308.00187

Country: North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.88)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.49)

arXiv.org Artificial IntelligenceMay-4-2023

Influence of various text embeddings on clustering performance in NLP

Saha, Rohan

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five) may be incongruous with the review text, which may be more suitable for a five star review. A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups. In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms. We use contextual (BERT) and non-contextual (Word2Vec) text embeddings to represent the text and measure their impact of three classes on clustering algorithms - partitioning based (KMeans), single linkage agglomerative hierarchical, and density based (DBSCAN and HDBSCAN), each with various experimental settings. We use the silhouette score, adjusted rand index score, and cluster purity score metrics to evaluate the performance of the algorithms and discuss the impact of different embeddings on the clustering performance. Our results indicate that the type of embedding chosen drastically affects the performance of the algorithm, the performance varies greatly across different types of clustering algorithms, no embedding type is better than the other, and DBSCAN outperforms KMeans and single linkage agglomerative clustering but also labels more data points as outliers. We provide a thorough comparison of the performances of different algorithms and provide numerous ideas to foster further research in the domain of text clustering.

artificial intelligence, data mining, machine learning, (17 more...)

2305.03144

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceJan-12-2023

Detecting Change Intervals with Isolation Distributional Kernel

Cao, Yang, Zhu, Ye, Ting, Kai Ming, Salim, Flora D., Li, Hong Xian, Li, Gang

Detecting abrupt changes in data distribution is one of the most significant tasks in streaming data analysis. Although many unsupervised Change-Point Detection (CPD) methods have been proposed recently to identify those changes, they still suffer from missing subtle changes, poor scalability, or/and sensitive to noise points. To meet these challenges, we are the first to generalise the CPD problem as a special case of the Change-Interval Detection (CID) problem. Then we propose a CID method, named iCID, based on a recent Isolation Distributional Kernel (IDK). iCID identifies the change interval if there is a high dissimilarity score between two non-homogeneous temporal adjacent intervals. The data-dependent property and finite feature map of IDK enabled iCID to efficiently identify various types of change points in data streams with the tolerance of noise points. Moreover, the proposed online and offline versions of iCID have the ability to optimise key parameter settings. The effectiveness and efficiency of iCID have been systematically verified on both synthetic and real-world datasets.

change point, data mining, machine learning, (20 more...)

2212.1463

Country:

Oceania > Australia > New South Wales (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceAug-17-2021, 00:10:08 GMT

Fully Explained OPTICS Clustering with Python Example

As we know that Clustering is a powerful unsupervised knowledge discovery tool used nowadays to segment our data points into groups of similar features types. However, each algorithm of clustering works according to the parameters. Similarity-based techniques (K-means clustering algorithm working is based on similarity of the data points and is tasked with designating how many clusters are available, while hierarchical clustering algorithms decide when to assign finished clusters manually. Generally used density-based clustering technique is DBSCAN which requires two parameters about how it defines its Core Points, but finding the parameters is an extremely difficult task. DBSCAN's relatively algorithm is called OPTICS (Ordering Points to Identify Cluster Structure).

dbscan, optic, reachability distance, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningMar-23-2021

The Success of AdaBoost and Its Application in Portfolio Management

Chuan, Yijian, Zhao, Chaoyi, He, Zhenrui, Wu, Lan

Equal-weighted portfolios are one of the most important strategies in portfolio management. They are portfolios with weights equally distributed across the selected securities in the long and/or short positions. In academic research, numerous studies have suggested that equal-weighted portfolios have a better out-of-sample performance than other portfolios (e.g., Jobson and Korkie 1981; James 2003; DeMiguel et al. 2007). Michaud (1989) and DeMiguel et al. (2007) argued that, the equal-weighted strategies do not suffer from the estimation error of the covariance matrix, which is vulnerable to outliers (Tu and Zhou, 2011). In industry, equal-weighted portfolios are popular across portfolio management in practice, particularly in the hedge funds. The MSCI has issued many equal-weighted indexes, which are "some of the oldest and best-known factor strategies that have aimed to identify specific characteristics of stocks generating excess return"

adaboost, classifier, noise point, (13 more...)

2103.12345

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > New Finding (0.87)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.68)

arXiv.org Machine LearningSep-17-2020

LAAT: Locally Aligned Ant Technique for detecting manifolds of varying density

Taghribi, Abolfazl, Bunte, Kerstin, Smith, Rory, Shin, Jihye, Mastropietro, Michele, Peletier, Reynier F., Tino, Peter

Dimensionality reduction and clustering are often used as preliminary steps for many complex machine learning tasks. The presence of noise and outliers can deteriorate the performance of such preprocessing and therefore impair the subsequent analysis tremendously. In manifold learning, several studies indicate solutions for removing background noise or noise close to the structure when the density is substantially higher than that exhibited by the noise. However, in many applications, including astronomical datasets, the density varies alongside manifolds that are buried in a noisy background. We propose a novel method to extract manifolds in the presence of noise based on the idea of Ant colony optimization. In contrast to the existing random walk solutions, our technique captures points which are locally aligned with major directions of the manifold. Moreover, we empirically show that the biologically inspired formulation of ant pheromone reinforces this behavior enabling it to recover multiple manifolds embedded in extremely noisy data clouds. The algorithm's performance is demonstrated in comparison to the state-of-the-art approaches, such as Markov Chain, LLPD, and Disperse, on several synthetic and real astronomical datasets stemming from an N-body simulation of a cosmological volume.

dataset, manifold, pheromone, (14 more...)

2009.08326

Country:

Europe > Netherlands > Groningen (0.04)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.54)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)