Goto

Collaborating Authors

 Clustering


ClusterNet : Semi-Supervised Clustering using Neural Networks

arXiv.org Machine Learning

Clustering using neural networks has recently demon- strated promising performance in machine learning and computer vision applications. However, the performance of current approaches is limited either by unsupervised learn- ing or their dependence on large set of labeled data sam- ples. In this paper, we propose ClusterNet that uses pair- wise semantic constraints from very few labeled data sam- ples (< 5% of total data) and exploits the abundant un- labeled data to drive the clustering approach. We define a new loss function that uses pairwise semantic similarity between objects combined with constrained k-means clus- tering to efficiently utilize both labeled and unlabeled data in the same framework. The proposed network uses con- volution autoencoder to learn a latent representation that groups data into k specified clusters, while also learning the cluster centers simultaneously. We evaluate and com- pare the performance of ClusterNet on several datasets and state of the art deep clustering approaches.


Variable Selection Methods for Model-based Clustering

arXiv.org Machine Learning

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.


Automatic Clustering of a Network Protocol with Weakly-Supervised Clustering

arXiv.org Machine Learning

Abstraction is a fundamental part when learning behavioral models of systems. Usually the process of abstraction is manually defined by domain experts. This paper presents a method to perform automatic abstraction for network protocols. In particular a weakly supervised clustering algorithm is used to build an abstraction with a small vocabulary size for the widely used TLS protocol. To show the effectiveness of the proposed method we compare the resultant abstract messages to a manually constructed (reference) abstraction. With a small amount of side-information in the form of a few labeled examples this method finds an abstraction that matches the reference abstraction perfectly.


Similarity encoding for learning with dirty categorical variables

arXiv.org Machine Learning

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.


Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

arXiv.org Artificial Intelligence

A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.


How Much Are You Willing to Share? A "Poker-Styled" Selective Privacy Preserving Framework for Recommender Systems

arXiv.org Machine Learning

Most industrial recommender systems rely on the popular collaborative filtering (CF) technique for providing personalized recommendations to its users. However, the very nature of CF is adversarial to the idea of user privacy, because users need to share their preferences with others in order to be grouped with like-minded people and receive accurate recommendations. Prior related work have proposed to preserve user privacy in a CF framework through different means like (i) random data obfuscation using differential privacy techniques, (ii) relying on decentralized trusted peer networks, or (iii) by adopting secured cryptographic strategies. While these approaches have been successful inasmuch as they concealed user preference information to some extent from a centralized recommender system, they have also, nevertheless, incurred significant tradeoffs in terms of privacy, scalability, and accuracy. They are also vulnerable to privacy breaches by malicious actors. In light of these observations, we propose a novel selective privacy preserving (SP2) paradigm that allows users to custom define the scope and extent of their individual privacies, by marking their personal ratings as either public (which can be shared) or private (which are never shared and stored only on the user device). Our SP2 framework works in two steps: (i) First, it builds an initial recommendation model based on the sum of all public ratings that have been shared by users and (ii) then, this public model is fine-tuned on each user's device based on the user private ratings, thus eventually learning a more accurate model. Furthermore, in this work, we introduce three different algorithms for implementing an end-to-end SP2 framework that can scale effectively from thousands to hundreds of millions of items.


Effect of antipsychotics on community structure in functional brain networks

arXiv.org Machine Learning

Schizophrenia, a mental disorder that is characterized by abnormal social behavior and failure to distinguish one's own thoughts and ideas from reality, has been associated with structural abnormalities in the architecture of functional brain networks. Using various methods from network analysis, we examine the effect of two classical therapeutic antipsychotics --- Aripiprazole and Sulpiride --- on the structure of functional brain networks of healthy controls and patients who have been diagnosed with schizophrenia. We compare the community structures of functional brain networks of different individuals using mesoscopic response functions, which measure how community structure changes across different scales of a network. We are able to do a reasonably good job of distinguishing patients from controls, and we are most successful at this task on people who have been treated with Aripiprazole. We demonstrate that this increased separation between patients and controls is related only to a change in the control group, as the functional brain networks of the patient group appear to be predominantly unaffected by this drug. This suggests that Aripiprazole has a significant and measurable effect on community structure in healthy individuals but not in individuals who are diagnosed with schizophrenia. In contrast, we find for individuals are given the drug Sulpiride that it is more difficult to separate the networks of patients from those of controls. Overall, we observe differences in the effects of the drugs (and a placebo) on community structure in patients and controls and also that this effect differs across groups. We thereby demonstrate that different types of antipsychotic drugs selectively affect mesoscale structures of brain networks, providing support that mesoscale structures such as communities are meaningful functional units in the brain.


Learning and Generalizing Motion Primitives from Driving Data for Path-Tracking Applications

arXiv.org Machine Learning

Considering the driving habits which are learned from the naturalistic driving data in the path-tracking system can significantly improve the acceptance of intelligent vehicles. Therefore, the goal of this paper is to generate the prediction results of lateral commands with confidence regions according to the reference based on the learned motion primitives. We present a two-level structure for learning and generalizing motion primitives through demonstrations. The lower-level motion primitives are generated under the path segmentation and clustering layer in the upper-level. The Gaussian Mixture Model(GMM) is utilized to represent the primitives and Gaussian Mixture Regression (GMR) is selected to generalize the motion primitives. We show how the upper-level can help to improve the prediction accuracy and evaluate the influence of different time scales and the number of Gaussian components. The model is trained and validated by using the driving data collected from the Beijing Institute of Technology (BIT) intelligent vehicle platform. Experiment results show that the proposed method can extract the motion primitives from the driving data and predict the future lateral control commands with high accuracy.


Efficient, Certifiably Optimal High-Dimensional Clustering

arXiv.org Machine Learning

We consider SDP relaxation methods for data and variable clustering problems, which have been shown in the literature to have good statistical properties in a variety of settings, but remain intractable to solve in practice. In particular, we propose FORCE, a new algorithm to solve the Peng-Wei $K$-means SDP. Compared to the naive interior point method, our method reduces the computational complexity of solving the SDP from $\tilde{O}(d^7\log\epsilon^{-1})$ to $\tilde{O}(d^{6}K^{-2}\epsilon^{-1})$. Our method combines a primal first-order method with a dual optimality certificate search, which when successful, allows for early termination of the primal method. We show under certain data generating distributions that, with high probability, FORCE is guaranteed to find the optimal solution to the SDP relaxation and provide a certificate of exact optimality. As verified by our numerical experiments, this allows FORCE to solve the Peng-Wei SDP with dimensions in the hundreds in only tens of seconds. We also consider a variation of the Peng-Wei SDP for the case when $K$ is not known a priori and show that a slight modification of FORCE reduces the computational complexity of solving this problem as well: from $\tilde{O}(d^7\log\epsilon^{-1})$ using a standard SDP solver to $\tilde{O}(d^{4}\epsilon^{-1})$.


Conformation Clustering of Long MD Protein Dynamics with an Adversarial Autoencoder

arXiv.org Machine Learning

Recent developments in specialized computer hardware have greatly accelerated atomic level Molecular Dynamics (MD) simulations. A single GPU-attached cluster is capable of producing microsecond-length trajectories in reasonable amounts of time. Multiple protein states and a large number of microstates associated with folding and with the function of the protein can be observed as conformations sampled in the trajectories. Clustering those conformations, however, is needed for identifying protein states, evaluating transition rates and understanding protein behavior. In this paper, we propose a novel data-driven generative conformation clustering method based on the adversarial autoencoder (AAE) and provide the associated software implementation Cong. The method was tested using a 208 microseconds MD simulation of the fast-folding peptide Trp-Cage (20 residues) obtained from the D.E. Shaw Research Group. The proposed clustering algorithm identifies many of the salient features of the folding process by grouping a large number of conformations that share common features not easily identifiable in the trajectory.