Goto

Collaborating Authors

 Data Mining


Doubly Stochastic Normalization for Spectral Clustering

Neural Information Processing Systems

In this paper we focus on the issue of normalization of the affinity matrix in spectral clustering.We show that the difference between N-cuts and Ratio-cuts is in the error measure being used (relative-entropy versus L


Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

Neural Information Processing Systems

We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi's entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.


Geometric entropy minimization (GEM) for anomaly detection and localization

Neural Information Processing Systems

We introduce a novel adaptive nonparametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Suchgraphs have the property that as N their span recovers the entropy minimizing set that supports at least ρ K/N(100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α 1 ρ. A method for implementing this nonparametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces.


Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

Neural Information Processing Systems

We propose a generic algorithm for computation of similarity measures for sequential data. The algorithm uses generalized suffix trees for efficient calculation of various kernel, distance and non-metric similarity functions. Its worst-case run-time is linear in the length of sequences and independent of the underlying embedding language, which can cover words, k-grams or all contained subsequences. Experiments with network intrusion detection, DNA analysis and text processing applications demonstrate the utility of distances and similarity coefficients for sequences as alternatives to classical kernel functions.


Graph-Based Visual Saliency

Neural Information Processing Systems

A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: rst forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits combination with other maps. The model is simple, and biologically plausible insofaras it is naturally parallelized. This model powerfully predicts human xations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch ([2], [3], [4]) achieve only 84%.


In-Network PCA and Anomaly Detection

Neural Information Processing Systems

We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomaliesby continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales.


Efficient Methods for Privacy Preserving Face Detection

Neural Information Processing Systems

Bob offers a face-detection web service where clients can submit their images for analysis. Alice would very much like to use the service, but is reluctant to reveal the content of her images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a lot of time, energy and money constructing it. Secure Multi-Party computations use cryptographic tools to solve this problem without leaking any information. Unfortunately, these methods are slow to compute and we introduce acouple of machine learning techniques that allow the parties to solve the problem while leaking a controlled amount of information. The first method is an information-bottleneck variant of AdaBoost that lets Bob find a subset of features that are enough for classifying an image patch, but not enough to actually reconstruct it.The second machine learning technique is active learning that allows Alice to construct an online classifier, based on a small number of calls to Bob's face detector. She can then use her online classifier as a fast rejector before using a cryptographically secure classifier on the remaining image patches.


Knowware: the third star after Hardware and Software

arXiv.org Artificial Intelligence

This book proposes to separate knowledge from software and to make it a commodity that is called knowware. The architecture, representation and function of Knowware are discussed. The principles of knowware engineering and its three life cycle models: furnace model, crystallization model and spiral model are proposed and analyzed. Techniques of software/knowware co-engineering are introduced. A software component whose knowledge is replaced by knowware is called mixware. An object and component oriented development schema of mixware is introduced. In particular, the tower model and ladder model for mixware development are proposed and discussed. Finally, knowledge service and knowware based Web service are introduced and compared with Web service. In summary, knowware, software and hardware should be considered as three equally important underpinnings of IT industry. Ruqian Lu is a professor of computer science of the Institute of Mathematics, Academy of Mathematics and System Sciences. He is a fellow of Chinese Academy of Sciences. His research interests include artificial intelligence, knowledge engineering and knowledge based software engineering. He has published more than 100 papers and 10 books. He has won two first class awards from the Academia Sinica and a National second class prize from the Ministry of Science and Technology. He has also won the sixth Hua Loo-keng Mathematics Prize.


Analyzing covert social network foundation behind terrorism disaster

arXiv.org Artificial Intelligence

This paper addresses a method to analyze the covert social network foundation hidden behind the terrorism disaster. It is to solve a node discovery problem, which means to discover a node, which functions relevantly in a social network, but escaped from monitoring on the presence and mutual relationship of nodes. The method aims at integrating the expert investigator's prior understanding, insight on the terrorists' social network nature derived from the complex graph theory, and computational data processing. The social network responsible for the 9/11 attack in 2001 is used to execute simulation experiment to evaluate the performance of the method.


Semantic distillation: a method for clustering objects by their contextual specificity

arXiv.org Machine Learning

Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.