Clustering
Appliance Operation Modes Identification Using Cycles Clustering
Jaradat, Abdelkareem, Lutfiyya, Hanan, Haque, Anwar
The increasing cost, energy demand, and environmental issues has led many researchers to find approaches for energy monitoring, and hence energy conservation. The emerging technologies of Internet of Things (IoT) and Machine Learning (ML) deliver techniques that have the potential to efficiently conserve energy and improve the utilization of energy consumption. Smart Home Energy Management Systems (SHEMSs) have the potential to contribute in energy conservation through the application of Demand Response (DR) in the residential sector. In this paper, we propose appliances Operation Modes Identification using Cycles Clustering (OMICC) which is SHEMS fundamental approach that utilizes the sensed residential disaggregated power consumption in supporting DR by providing consumers the opportunity to select lighter appliance operation modes. The cycles of the Single Usage Profile (SUP) of an appliance are extracted and reformed into features in terms of clusters of cycles. These features are then used to identify the operation mode used in every occurrence using K-Nearest Neighbors (KNN). Operation modes identification is considered a basis for many potential smart DR applications within SHEMS towards the consumers or the suppliers
Medical Information Retrieval and Interpretation: A Question-Answer based Interaction Model
Sinhababu, Nilanjan, Saxena, Rahul, Sarma, Monalisa, Samanta, Debasis
The Internet has become a very powerful platform where diverse medical information are expressed daily. Recently, a huge growth is seen in searches like symptoms, diseases, medicines, and many other health related queries around the globe. The search engines typically populate the result by using the single query provided by the user and hence reaching to the final result may require a lot of manual filtering from the user's end. Current search engines and recommendation systems still lack real time interactions that may provide more precise result generation. This paper proposes an intelligent and interactive system tied up with the vast medical big data repository on the web and illustrates its potential in finding medical information.
Unsupervised clustering of series using dynamic programming
Sinnathamby, Karthigan, Hou, Chang-Yu, Venkataramanan, Lalitha, Gkortsas, Vasileios-Marios, Fleuret, Franรงois
Unsupervised clustering is a branch of machine learning that aims to categorize the data based on the self-similarity. In other word, data-points in the same group (called a cluster) are more similar to each other than to those in other groups. This task can be achieved by various algorithms (the well-known K-means or spectral clustering but also hierarchical clustering [1] or density-based clustering [2]) that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. In many cases, there exist models/functions, governed by a finite set of parameters, providing either physics or phenomenology correlations between input data. The presence of these models can in principle be used to characterize clusters (cluster characterization) because one can define a loss function to measure how well a point belongs to this cluster (cluster affiliation).
ARTH: Algorithm For Reading Text Handily -- An AI Aid for People having Word Processing Issues
Malhotra, Akanksha, Kamle, Sudhir
The objective of this project is to solve one of the major problems faced by the people having word processing issues like trauma, or mild mental disability. "ARTH" is the short form of Algorithm for Reading Handily. ARTH is a self-learning set of algorithms that is an intelligent way of fulfilling the need for "reading and understanding the text effortlessly" which adjusts according to the needs of every user. The research project propagates in two steps. In the first step, the algorithm tries to identify the difficult words present in the text based on two features -- the number of syllables and usage frequency -- using a clustering algorithm. After the analysis of the clusters, the algorithm labels these clusters, according to their difficulty level. In the second step, the algorithm interacts with the user. It aims to test the user's comprehensibility of the text and his/her vocabulary level by taking an automatically generated quiz. The algorithm identifies the clusters which are difficult for the user, based on the result of the analysis. The meaning of perceived difficult words is displayed next to them. The technology "ARTH" focuses on the revival of the joy of reading among those people, who have a poor vocabulary or any word processing issues.
GPU-Accelerated Optimizer-Aware Evaluation of Submodular Exemplar Clustering
Honysz, Philipp-Jan, Buschjรคger, Sebastian, Morik, Katharina
The optimization of submodular functions constitutes a viable way to perform clustering. Strong approximation guarantees and feasible optimization w.r.t. streaming data make this clustering approach favorable. Technically, submodular functions map subsets of data to real values, which indicate how "representative" a specific subset is. Optimal sets might then be used to partition the data space and to infer clusters. Exemplar-based clustering is one of the possible submodular functions, but suffers from high computational complexity. However, for practical applications, the particular real-time or wall-clock run-time is decisive. In this work, we present a novel way to evaluate this particular function on GPUs, which keeps the necessities of optimizers in mind and reduces wall-clock run-time. To discuss our GPU algorithm, we investigated both the impact of different run-time critical problem properties, like data dimensionality and the number of data points in a subset, and the influence of required floating-point precision. In reproducible experiments, our GPU algorithm was able to achieve competitive speedups of up to 72x depending on whether multi-threaded computation on CPUs was used for comparison and the type of floating-point precision required. Half-precision GPU computation led to large speedups of up to 452x compared to single-precision, single-thread CPU computations.
Leveraging AI to optimize website structure discovery during Penetration Testing
Antonelli, Diego, Cascella, Roberta, Perrone, Gaetano, Romano, Simon Pietro, Schiano, Antonio
Dirbusting is a technique used to brute force directories and file names on web servers while monitoring HTTP responses, in order to enumerate server contents. Such a technique uses lists of common words to discover the hidden structure of the target website. Dirbusting typically relies on response codes as discovery conditions to find new pages. It is widely used in web application penetration testing, an activity that allows companies to detect websites vulnerabilities. Dirbusting techniques are both time and resource consuming and innovative approaches have never been explored in this field. We hence propose an advanced technique to optimize the dirbusting process by leveraging Artificial Intelligence. More specifically, we use semantic clustering techniques in order to organize wordlist items in different groups according to their semantic meaning. The created clusters are used in an ad-hoc implemented next-word intelligent strategy. This paper demonstrates that the usage of clustering techniques outperforms the commonly used brute force methods. Performance is evaluated by testing eight different web applications. Results show a performance increase that is up to 50% for each of the conducted experiments.
CaEGCN: Cross-Attention Fusion based Enhanced Graph Convolutional Network for Clustering
Huo, Guangyu, Zhang, Yong, Gao, Junbin, Wang, Boyue, Hu, Yongli, Yin, Baocai
With the powerful learning ability of deep convolutional networks, deep clustering methods can extract the most discriminative information from individual data and produce more satisfactory clustering results. However, existing deep clustering methods usually ignore the relationship between the data. Fortunately, the graph convolutional network can handle such relationship, opening up a new research direction for deep clustering. In this paper, we propose a cross-attention based deep clustering framework, named Cross-Attention Fusion based Enhanced Graph Convolutional Network (CaEGCN), which contains four main modules: the cross-attention fusion module which innovatively concatenates the Content Auto-encoder module (CAE) relating to the individual data and Graph Convolutional Auto-encoder module (GAE) relating to the relationship between the data in a layer-by-layer manner, and the self-supervised model that highlights the discriminative information for clustering tasks. While the cross-attention fusion module fuses two kinds of heterogeneous representation, the CAE module supplements the content information for the GAE module, which avoids the over-smoothing problem of GCN. In the GAE module, two novel loss functions are proposed that reconstruct the content and relationship between the data, respectively. Finally, the self-supervised module constrains the distributions of the middle layer representations of CAE and GAE to be consistent. Experimental results on different types of datasets prove the superiority and robustness of the proposed CaEGCN.
Interactive slice visualization for exploring machine learning models
Hurley, Catherine B., O'Connell, Mark, Domijan, Katarina
Machine learning models fit complex algorithms to arbitrarily large datasets. These algorithms are well-known to be high on performance and low on interpretability. We use interactive visualization of slices of predictor space to address the interpretability deficit; in effect opening up the black-box of machine learning algorithms, for the purpose of interrogating, explaining, validating and comparing model fits. Slices are specified directly through interaction, or using various touring algorithms designed to visit high-occupancy sections or regions where the model fits have interesting properties. The methods presented here are implemented in the R package \pkg{condvis2}.
Multi-view Data Visualisation via Manifold Learning
Rodosthenous, Theodoulos, Shahrezaei, Vahid, Evangelou, Marina
Manifold learning approaches, such as Stochastic Neighbour Embedding (SNE), Locally Linear Embedding (LLE) and Isometric Feature Mapping (ISOMAP) have been proposed for performing non-linear dimensionality reduction. These methods aim to produce two or three latent embeddings, in order to visualise the data in intelligible representations. This manuscript proposes extensions of Student's t-distributed SNE (t-SNE), LLE and ISOMAP, to allow for dimensionality reduction and subsequent visualisation of multi-view data. Nowadays, it is very common to have multiple data-views on the same samples. Each data-view contains a set of features describing different aspects of the samples. For example, in biomedical studies it is possible to generate multiple OMICS data sets for the same individuals, such as transcriptomics, genomics, epigenomics, enabling better understanding of the relationships between the different biological processes. Through the analysis of real and simulated datasets, the visualisation performance of the proposed methods is illustrated. Data visualisations have been often utilised for identifying any potential clusters in the data sets. We show that by incorporating the low-dimensional embeddings obtained via the multi-view manifold learning approaches into the K-means algorithm, clusters of the samples are accurately identified. Our proposed multi-SNE method outperforms the corresponding multi-ISOMAP and multi-LLE proposed methods. Interestingly, multi-SNE is found to have comparable performance with methods proposed in the literature for performing multi-view clustering.
Optimal Clustering in Anisotropic Gaussian Mixture Models
We study the clustering task under anisotropic Gaussian Mixture Models where the covariance matrices from different clusters are unknown and are not necessarily the identical matrix. We characterize the dependence of signal-to-noise ratios on the cluster centers and covariance matrices and obtain the minimax lower bound for the clustering problem. In addition, we propose a computationally feasible procedure and prove it achieves the optimal rate within a few iterations. The proposed procedure is a hard EM type algorithm, and it can also be seen as a variant of the Lloyd's algorithm that is adjusted to the anisotropic covariance matrices.