Statistical Learning
On-line regression competitive with reproducing kernel Hilbert spaces
We consider the problem of on-line prediction of real-valued labels, assumed bounded in absolute value by a known constant, of new objects from known labeled objects. The prediction algorithm's performance is measured by the squared deviation of the predictions from the actual labels. No stochastic assumptions are made about the way the labels and objects are generated. Instead, we are given a benchmark class of prediction rules some of which are hoped to produce good predictions. We show that for a wide range of infinite-dimensional benchmark classes one can construct a prediction algorithm whose cumulative loss over the first N examples does not exceed the cumulative loss of any prediction rule in the class plus O(sqrt(N)); the main differences from the known results are that we do not impose any upper bound on the norm of the considered prediction rules and that we achieve an optimal leading term in the excess loss of our algorithm. If the benchmark class is "universal" (dense in the class of continuous functions on each compact set), this provides an on-line non-stochastic analogue of universally consistent prediction in non-parametric statistics. We use two proof techniques: one is based on the Aggregating Algorithm and the other on the recently developed method of defensive forecasting.
K-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data
He, Zengyou, Xu, Xiaofei, Deng, Shengchun
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, Average Normalized Mutual Information-ANMI) borrowed from cluster ensemble. Experimental results on real datasets show that k-ANMI algorithm is competitive with those state-of-art categorical data clustering algorithms with respect to clustering accuracy.
Multiresolution Kernels
Cuturi, Marco, Fukumizu, Kenji
We present in this work a new methodology to design kernels on data which is structured with smaller components, such as text, images or sequences. This methodology is a template procedure which can be applied on most kernels on measures and takes advantage of a more detailed "bag of components" representation of the objects. To obtain such a detailed description, we consider possible decompositions of the original bag into a collection of nested bags, following a prior knowledge on the objects' structure. We then consider these smaller bags to compare two objects both in a detailed perspective, stressing local matches between the smaller bags, and in a global or coarse perspective, by considering the entire bag. This multiresolution approach is likely to be best suited for tasks where the coarse approach is not precise enough, and where a more subtle mixture of both local and global similarities is necessary to compare objects. The approach presented here would not be computationally tractable without a factorization trick that we introduce before presenting promising results on an image retrieval task.
Non-asymptotic calibration and resolution
We consider the problem of forecasting a new observation from the available data, which may include, e.g., all or some of the previous observation s and the values of some explanatory variables. To make the process of fore casting more vivid, we imagine that the data and observations are chosen by a play er called Reality and the forecasts are made by a player called Forecaster. T o establish properties of forecasting algorithms, the traditional theory of m achine learning makes some assumptions about the way Reality generates the ob servations; e.g., statistical learning theory [28] assumes that the data and obs ervations are generated independently from the same probability distribution. A m ore recent approach, prediction with expert advice (see, e.g., [5]), replaces th e assumptions about Reality by a comparison class of prediction strategies; a typical result of this theory asserts that Forecaster can perform almos t as well as the best strategies in the comparison class. This paper further explor es a third possibility, suggested in [11], which requires neither assumptions abo ut Reality nor a comparison class of Forecaster's strategies.
Attribute Value Weighting in K-Modes Clustering
He, Zengyou, Xu, Xaiofei, Deng, Shengchun
Categorical data clustering is an important research problem in pattern recognition and data mining. The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. The k -modes algorithm is widely used in real world applications due to its efficiency in dealing with large categorical database. In standard k -modes algorithm, a simple matching similarity measure is used, in which the distance is either 0 or 1. Such simple matching dissimilarity measure doesn't consider the implicit similarity relationship embedded in categorical values, which will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. To illustrate this fact, let's consider the following example shown in Fig.1. Example 1: In this artificial example, the dataset is described with 3 categorical attributes A1, A2,and A3, and there are two clusters with their modes. Assuming that we have to allocate a data object Y = [a, p, w] to either cluster 1 or cluster 2. According to the k -modes algorithm, we can assign Y to either cluster 1 or cluster 2 since these two clusters have the same mode. However, from the viewpoint of intra-cluster simila rity, it is more desirable to allocate Y to cluster 1.
A Note on Local Ultrametricity in Text
Structures that are inherent to data of any type can be of import ance, and hierarchical structure is a prime example. In this work we take text corpora and assess the extent of hierarchical structure among words co nstituting the texts. By comprehensively taking context into account we seek to study hierarchical structures in the domain semantics. The data studied in Rammal et al. (1986) and Murtagh (2004) is point pattern data: observational features with their measurements on many coordinate dimensions. Data may be instead presented as time-varyin g signals and in a similar way, related to the findings of Rammal et al. (1986) and 1 Murtagh (2004), we have investigated ultrametric-related prope rties of time series or 1D signals in Murtagh (2005a).
A kernel method for canonical correlation analysis
Canonical correlation analysis is a technique to extract common features from a pair of multivariate data. In complex situations, however, it does not extract useful features because of its linearity. On the other hand, kernel method used in support vector machine is an efficient approach to improve such a linear method. In this paper, we investigate the effectiveness of applying kernel method to canonical correlation analysis.
Penalized Likelihood Methods for Estimation of Sparse High Dimensional Directed Acyclic Graphs
Shojaie, Ali, Michailidis, George
Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical, as well as biological systems, where directed edges between nodes represent the influence of components of the system on each other. The general problem of estimating DAGs from observed data is computationally NP-hard, Moreover two directed graphs may be observationally equivalent. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to the problem of estimating the structure of the network. In this paper, we propose a penalized likelihood approach that directly estimates the adjacency matrix of DAGs. Both lasso and adaptive lasso penalties are considered and an efficient algorithm is proposed for estimation of high dimensional DAGs. We study variable selection consistency of the two penalties when the number of variables grows to infinity with the sample size. We show that although lasso can only consistently estimate the true network under stringent assumptions, adaptive lasso achieves this task under mild regularity conditions. The performance of the proposed methods is compared to alternative methods in simulated, as well as real, data examples.
Sparse Convolved Multiple Output Gaussian Processes
Álvarez, Mauricio A., Lawrence, Neil D.
Recently there has been an increasing interest in methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different sparse approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in pollution prediction, school exams score prediction and gene expression data.
Dimension reduction and variable selection in case control studies via regularized likelihood optimization
Bunea, Florentina, Barbu, Adrian
Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this literature by studying estimators obtained via L1 penalized likelihood optimization. We show that the optimizers of the L1 penalized retrospective likelihood coincide with the optimizers of the L1 penalized prospective likelihood. This extends the results of Prentice and Pyke (1979), obtained for non-regularized likelihoods. We establish both the sup-norm consistency of the odds ratio, after model selection, and the consistency of subset selection of our estimators. The novelty of our theoretical results consists in the study of these properties under the case-control sampling scheme. Our results hold for selection performed over a large collection of candidate variables, with cardinality allowed to depend and be greater than the sample size. We complement our theoretical results with a novel approach of determining data driven tuning parameters, based on the bisection method. The resulting procedure offers significant computational savings when compared with grid search based methods. All our numerical experiments support strongly our theoretical findings.