AITopics

doi: 10.1145/3491.039

2110.05035

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Italy > Apulia > Bari (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Information Technology > Services (0.93)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Chatalic, Antoine, Carratino, Luigi, De Vito, Ernesto, Rosasco, Lorenzo

Mean Nystr\"om Embeddings for Adaptive Compressive Learning

arXiv.org Machine LearningOct-21-2021

Compressive learning is an approach to efficient large scale learning based on sketching an entire dataset to a single mean embedding (the sketch), i.e. a vector of generalized moments. The learning task is then approximately solved as an inverse problem using an adapted parametric model. Previous works in this context have focused on sketches obtained by averaging random features, that while universal can be poorly adapted to the problem at hand. In this paper, we propose and study the idea of performing sketching based on data-dependent Nystr\"om approximation. From a theoretical perspective we prove that the excess risk can be controlled under a geometric assumption relating the parametric model used to learn from the sketch and the covariance operator associated to the task at hand. Empirically, we show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nystr\"om sketches indeed outperform those built with random features.

dataset, gaussian modeling, sketch, (15 more...)

2110.10996

Country:

Europe > Italy > Liguria > Genoa (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Adesunkanmi, Rahmat, Kumar, Ratnesh

Noise-robust Clustering

arXiv.org Machine LearningOct-19-2021

This paper presents noise-robust clustering techniques in unsupervised machine learning. The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics. As a result, data quality, cleansing, management, and governance remain critical disciplines when working with Big Data. With this complexity, it is no longer sufficient to treat data deterministically as in a classical setting, and it becomes meaningful to account for noise distribution and its impact on data sample values. Classical clustering methods group data into "similarity classes" depending on their relative distances or similarities in the underlying space. This paper addressed this problem via the extension of classical $K$-means and $K$-medoids clustering over data distributions (rather than the raw data). This involves measuring distances among distributions using two types of measures: the optimal mass transport (also called Wasserstein distance, denoted $W_2$) and a novel distance measure proposed in this paper, the expected value of random variable distance (denoted ED). The presented distribution-based $K$-means and $K$-medoids algorithms cluster the data distributions first and then assign each raw data to the cluster of data's distribution.

k-means, k-means and k-medoid, k-medoid, (16 more...)

2110.08871

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Colorado (0.04)
North America > United States > Iowa > Story County > Ames (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Candel, Gaëlle, Naccache, David

Noise-Resilient Ensemble Learning using Evidence Accumulation Clustering

arXiv.org Artificial IntelligenceOct-18-2021

Ensemble Learning methods combine multiple algorithms performing the same task to build a group with superior quality. These systems are well adapted to the distributed setup, where each peer or machine of the network hosts one algorithm and communicate its results to its peers. Ensemble learning methods are naturally resilient to the absence of several peers thanks to the ensemble redundancy. However, the network can be corrupted, altering the prediction accuracy of a peer, which has a deleterious effect on the ensemble quality. In this paper, we propose a noise-resilient ensemble classification method, which helps to improve accuracy and correct random errors. The approach is inspired by Evidence Accumulation Clustering , adapted to classification ensembles. We compared it to the naive voter model over four multi-class datasets. Our model showed a greater resilience, allowing us to recover prediction under a very high noise level. In addition as the method is based on the evidence accumulation clustering, our method is highly flexible as it can combines classifiers with different label definitions.

accuracy, boundary, classifier, (13 more...)

2110.09212

Genre: Research Report (0.64)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Vankadara, Leena Chennuru, Bordt, Sebastian, von Luxburg, Ulrike, Ghoshdastidar, Debarghya

Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models

arXiv.org Machine LearningOct-18-2021

Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.

algorithm, kernel, partition, (16 more...)

2110.09476

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Gan, Luqin, Allen, Genevera I.

Fast and Interpretable Consensus Clustering via Minipatch Learning

arXiv.org Machine LearningOct-18-2021

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which leads to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.

accuracy, consensus, prognostic marker, (15 more...)

2110.02388

Country:

North America > United States > Texas > Harris County > Houston (0.05)
Asia (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.69)
Health & Medicine > Therapeutic Area > Endocrinology (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Nweye, Kingsley, Nagy, Zoltan

MARTINI: Smart Meter Driven Estimation of HVAC Schedules and Energy Savings Based on WiFi Sensing and Clustering

arXiv.org Artificial IntelligenceOct-17-2021

HVAC systems account for a significant portion of building energy use. Nighttime setback scheduling is an energy conservation measure where cooling and heating setpoints are increased and decreased respectively during unoccupied periods with the goal of obtaining energy savings. However, knowledge of a building's real occupancy is required to maximize the success of this measure. In addition, there is the need for a scalable way to estimate energy savings potential from energy conservation measures that is not limited by building specific parameters and experimental or simulation modeling investments. Here, we propose MARTINI, a sMARt meTer drIveN estImation of occupant-derived HVAC schedules and energy savings that leverages the ubiquity of energy smart meters and WiFi infrastructure in commercial buildings. We estimate the schedules by clustering WiFi-derived occupancy profiles and, energy savings by shifting ramp-up and setback times observed in typical/measured load profiles obtained by clustering smart meter energy profiles. Our case-study results with five buildings over seven months show an average of 8.1%-10.8% (summer) and 0.2%-5.9% (fall) chilled water energy savings when HVAC system operation is aligned with occupancy. We validate our method with results from building energy performance simulation (BEPS) and find that estimated average savings of MARTINI are within 0.9%-2.4% of the BEPS predictions. In the absence of occupancy information, we can still estimate potential savings from increasing ramp-up time and decreasing setback start time. In 51 academic buildings, we find savings potentials between 1%-5%.

artificial intelligence, internet of things, machine learning, (18 more...)

doi: 10.1016/j.apenergy.2022.118980

2110.08927

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Energy (1.00)
Construction & Engineering > HVAC (1.00)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

arXiv.org Artificial IntelligenceOct-14-2021

Self-supervised Contrastive Attributed Graph Clustering

Xia, Wei, Gao, Quanxue, Yang, Ming, Gao, Xinbo

Attributed graph clustering, which learns node representation from node attribute and topological graph for clustering, is a fundamental but challenging task for graph analysis. Recently, methods based on graph contrastive learning (GCL) have obtained impressive clustering performance on this task. Yet, we observe that existing GCL-based methods 1) fail to benefit from imprecise clustering labels; 2) require a post-processing operation to get clustering labels; 3) cannot solve out-of-sample (OOS) problem. To address these issues, we propose a novel attributed graph clustering network, namely Self-supervised Contrastive Attributed Graph Clustering (SCAGC). In SCAGC, by leveraging inaccurate clustering labels, a self-supervised contrastive loss, which aims to maximize the similarities of intra-cluster nodes while minimizing the similarities of inter-cluster nodes, are designed for node representation learning. Meanwhile, a clustering module is built to directly output clustering labels by contrasting the representation of different clusters. Thus, for the OOS nodes, SCAGC can directly calculate their clustering labels. Extensive experimental results on four benchmark datasets have shown that SCAGC consistently outperforms 11 competitive clustering methods.

node, node representation, representation, (15 more...)

2110.08264

Country: Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

De Koninck, Pieter, Nelissen, Klaas, Broucke, Seppe vanden, Baesens, Bart, Snoeck, Monique, De Weerdt, Jochen

Expert-driven Trace Clustering with Instance-level Constraints

arXiv.org Artificial IntelligenceOct-13-2021

Within the field of process mining, several different trace clustering approaches exist for partitioning traces or process instances into similar groups. Typically, this partitioning is based on certain patterns or similarity between the traces, or driven by the discovery of a process model for each cluster. The main drawback of these techniques, however, is that their solutions are usually hard to evaluate or justify by domain experts. In this paper, we present two constrained trace clustering techniques that are capable to leverage expert knowledge in the form of instance-level constraints. In an extensive experimental evaluation using two real-life datasets, we show that our novel techniques are indeed capable of producing clustering solutions that are more justifiable without a substantial negative impact on their quality.

cannot-link constraint, constraint, process model, (13 more...)

doi: 10.1007/s10115-021-01548-6

2110.06703

Country:

Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)
Europe > United Kingdom > England > Hampshire > Southampton (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(5 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Silva, Guilherme D. F., Akoglu, Leman, Cordeiro, Robson L. F.

C-AllOut: Catching & Calling Outliers by Type

arXiv.org Artificial IntelligenceOct-13-2021

Given an unlabeled dataset, wherein we have access only to pairwise similarities (or distances), how can we effectively (1) detect outliers, and (2) annotate/tag the outliers by type? Outlier detection has a large literature, yet we find a key gap in the field: to our knowledge, no existing work addresses the outlier annotation problem. Outliers are broadly classified into 3 types, representing distinct patterns that could be valuable to analysts: (a) global outliers are severe yet isolate cases that do not repeat, e.g., a data collection error; (b) local outliers diverge from their peers within a context, e.g., a particularly short basketball player; and (c) collective outliers are isolated micro-clusters that may indicate coalition or repetitions, e.g., frauds that exploit the same loophole. This paper presents C-AllOut: a novel and effective outlier detector that annotates outliers by type. It is parameter-free and scalable, besides working only with pairwise similarities (or distances) when it is needed. We show that C-AllOut achieves on par or significantly better performance than state-of-the-art detectors when spotting outliers regardless of their type. It is also highly effective in annotating outliers of particular types, a task that none of the baselines can perform.

c-allout, collective outlier, outlier, (14 more...)

2110.08257

Country:

South America > Brazil > São Paulo (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)