AITopics

2303.14581

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Consumer Health (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
(3 more...)

Tran, Le-Anh, Park, Dong-Chul

Feature Embedding Clustering using POCS-based Clustering Algorithm

arXiv.org Artificial IntelligenceMar-25-2023

An application of the POCS-based clustering algorithm (POCS stands for Projection Onto Convex Set), a novel clustering technique, for feature embedding clustering problems is proposed in this paper. The POCS-based clustering algorithm applies the POCS's convergence property to clustering problems and has shown competitive performance when compared with that of other classical clustering schemes in terms of clustering error and execution speed. Specifically, the POCS-based clustering algorithm treats each data point as a convex set and applies a parallel projection operation from every cluster prototype to corresponding data members in order to minimize the objective function and update the prototypes. The experimental results on the synthetic embedding datasets extracted from the 5 Celebrity Faces and MNIST datasets show that the POCS-based clustering algorithm can perform with favorable results when compared with those of other classical clustering schemes such as the K-Means and Fuzzy C-Means algorithms in feature embedding clustering problems.

algorithm, artificial intelligence, machine learning, (17 more...)

2305.00001

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Russia (0.04)
Asia > South Korea (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningMar-25-2023

clusterBMA: Bayesian model averaging for clustering

Forbes, Owen, Santos-Fernandez, Edgar, Wu, Paul Pao-Yen, Xie, Hong-Bo, Schwenn, Paul E., Lagopoulos, Jim, Mills, Lia, Sacks, Dashiell D., Hermens, Daniel F., Mengersen, Kerrie

Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and parameters chosen. Bayesian model averaging (BMA) is a popular approach for combining results across multiple models that offers some attractive benefits in this setting, including probabilistic interpretation of the combined cluster structure and quantification of model-based uncertainty. In this work we introduce clusterBMA, a method that enables weighted model averaging across results from multiple unsupervised clustering algorithms. We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model. From a consensus matrix representing a weighted average of the clustering solutions across models, we apply symmetric simplex matrix factorisation to calculate final probabilistic cluster allocations. In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters, combining allocation probabilities from 'hard' and 'soft' clustering algorithms, and measuring model-based uncertainty in averaged cluster allocation. This method is implemented in an accompanying R package of the same name.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

2209.04117

Country:

Europe > Austria > Vienna (0.14)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

#artificialintelligenceMar-24-2023, 15:30:43 GMT

POCS-based Clustering Algorithm Explained

Cluster analysis (or clustering) is a data analysis technique that explores and groups a set of vectors (or data points) in such a way that vectors in the same cluster are more similar to one another than to those in other clusters. Clustering algorithms are widely used in numerous applications, e.g., data analysis, pattern recognition, and image processing. This article reviews a new clustering algorithm based on the method of Projection onto Convex Sets (POCS), called POCS-based clustering algorithm. The original paper was introduced in IWIS2022 and the source code has also been released on Github. A convex set is defined as a set of data points in which a line segment connecting any two points x1 and x2 in the set is completely subsumed in this set.

algorithm, convex, projection, (12 more...)

#artificialintelligence

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Scaling Expert Language Models with Unsupervised Domain Discovery

Gururangan, Suchin, Li, Margaret, Lewis, Mike, Shi, Weijia, Althoff, Tim, Smith, Noah A., Zettlemoyer, Luke

Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training large language models.

large language model, machine learning, scaling expert language model, (19 more...)

2303.14177

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Education (0.46)
Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Agerskov, Jimmi, Nielsen, Kristian, Lillelund, Christian Marius, Pedersen, Christian Fischer

Computationally Efficient Labeling of Cancer Related Forum Posts by Non-Clinical Text Information Retrieval

An abundance of information about cancer exists online, but categorizing and extracting useful information from it is difficult. Almost all research within healthcare data processing is concerned with formal clinical data, but there is valuable information in non-clinical data too. The present study combines methods within distributed computing, text retrieval, clustering, and classification into a coherent and computationally efficient system, that can clarify cancer patient trajectories based on non-clinical and freely available information. We produce a fully-functional prototype that can retrieve, cluster and present information about cancer trajectories from non-clinical forum posts. We evaluate three clustering algorithms (MR-DBSCAN, DBSCAN, and HDBSCAN) and compare them in terms of Adjusted Rand Index and total run time as a function of the number of posts retrieved and the neighborhood radius. Clustering results show that neighborhood radius has the most significant impact on clustering performance. For small values, the data set is split accordingly, but high values produce a large number of possible partitions and searching for the best partition is hereby time-consuming. With a proper estimated radius, MR-DBSCAN can cluster 50000 forum posts in 46.1 seconds, compared to DBSCAN (143.4) and HDBSCAN (282.3). We conduct an interview with the Danish Cancer Society and present our software prototype. The organization sees a potential in software that can democratize online information about cancer and foresee that such systems will be required in the future.

data mining, machine learning, springer nature 2021, (17 more...)

2303.16766

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.46)
Oceania > Australia (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(5 more...)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)

Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

Gaido, Marco

In the big data era, the key feature that each algorithm needs to have is the possibility of efficiently running in parallel in a distributed environment. The popular Silhouette metric to evaluate the quality of a clustering, unfortunately, does not have this property and has a quadratic computational complexity with respect to the size of the input dataset. For this reason, its execution has been hindered in big data scenarios, where clustering had to be evaluated otherwise. To fill this gap, in this paper we introduce the first algorithm that computes the Silhouette metric with linear complexity and can easily execute in parallel in a distributed environment. Its implementation is freely available in the Apache Spark ML library.

artificial intelligence, data mining, machine learning, (19 more...)

2303.14102

Country: Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Local Clustering in Contextual Multi-Armed Bandits

Ban, Yikun, He, Jingrui

We study identifying user clusters in contextual multi-armed bandits (MAB). Contextual MAB is an effective tool for many real applications, such as content recommendation and online advertisement. In practice, user dependency plays an essential role in the user's actions, and thus the rewards. Clustering similar users can improve the quality of reward estimation, which in turn leads to more effective content recommendation and targeted advertising. Different from traditional clustering settings, we cluster users based on the unknown bandit parameters, which will be estimated incrementally. In particular, we define the problem of cluster detection in contextual MAB, and propose a bandit algorithm, LOCB, embedded with local clustering procedure. And, we provide theoretical analysis about LOCB in terms of the correctness and efficiency of clustering and its regret bound. Finally, we evaluate the proposed algorithm from various aspects, which outperforms state-of-the-art baselines.

data mining, locb, machine learning, (18 more...)

2103.00063

Country:

Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.05)
North America > United States > Illinois (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry:

Media (0.93)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Curto, Georgina, Kiritchenko, Svetlana, Nejadgholi, Isar, Fraser, Kathleen C.

The crime of being poor

The criminalization of poverty has been widely denounced as a collective bias against the most vulnerable. NGOs and international organizations claim that the poor are blamed for their situation, are more often associated with criminal offenses than the wealthy strata of society and even incur criminal offenses simply as a result of being poor. While no evidence has been found in the literature that correlates poverty and overall criminality rates, this paper offers evidence of a collective belief that associates both concepts. This brief report measures the societal bias that correlates criminality with the poor, as compared to the rich, by using Natural Language Processing (NLP) techniques in Twitter. The paper quantifies the level of crime-poverty bias in a panel of eight different English-speaking countries. The regional differences in the association between crime and poverty cannot be justified based on different levels of inequality or unemployment, which the literature correlates to property crimes. The variation in the observed rates of crime-poverty bias for different geographic locations could be influenced by cultural factors and the tendency to overestimate the equality of opportunities and social mobility in specific countries. These results have consequences for policy-making and open a new path of research for poverty mitigation with the focus not only on the poor but on society as a whole. Acting on the collective bias against the poor would facilitate the approval of poverty reduction policies, as well as the restoration of the dignity of the persons affected.

artificial intelligence, machine learning, natural language, (20 more...)

2303.14128

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
Africa > South Africa (0.05)
Africa > Kenya (0.05)
(9 more...)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine (0.94)
Banking & Finance > Economy (0.92)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Zellinger, Michael J., Bühlmann, Peter

repliclust: Synthetic Data for Cluster Analysis

Our approach is based on data set archetypes, high-level geometric descriptions from which the user can create many different data sets, each possessing the desired geometric characteristics. The architecture of our software is modular and object-oriented, decomposing data generation into algorithms for placing cluster centers, sampling cluster shapes, selecting the number of data points for each cluster, and assigning probability distributions to clusters.

artificial intelligence, machine learning, overlap, (17 more...)

2303.14301

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Alberta (0.14)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)