Dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection (find a subset of the original variables) and feature extraction (transform the data in the high-dimensional space to a space of fewer dimensions). (Wikipedia)
This course is designed to give you the Visualization/Dimensionality Reduction skills you need to become an expert data scientist. By the end of the course, you will understand Visualization/Dimensionality Reduction extremely well and be able to use the techniques on your own projects and be productive as a computer scientist and data analyst. Here's just some of what you'll learn
To make sense of the world our brains must analyze high-dimensional datasets streamed by our sensory organs. Because such analysis begins with dimensionality reduction, modeling early sensory processing requires biologically plausible online dimensionality reduction algorithms. Recently, we derived such an algorithm, termed similarity matching, from a Multidimensional Scaling (MDS) objective function. However, in the existing algorithm, the number of output dimensions is set a priori by the number of output neurons and cannot be changed. Because the number of informative dimensions in sensory inputs is variable there is a need for adaptive dimensionality reduction.
This paper makes some pretty critical mistakes regarding previous work. For one, they cite , but they should in fact be citing Cohen et al. "Dimensionality Reduction for k-Means Clustering and Low Rank Approximation" This is not just a typo - the authors go on to state a result of  about operator norm rather than the result of the Cohen et al. paper - namely, the Cohen et al. paper achieves O(k/eps 2) rescaled columns deterministically for exactly the same problem considered in this submission - see part 5 of Lemma 11 and section 7.3 based on BSS. This is much stronger than the O(k 2/eps 2) rescaled columns achieved in the submission. This directly contradicts their sentence "Our main result is the first algorithm for computing an (k,eps)-coreset C of size independent of both n and d". The authors also say later [8,7] minimize the 2-norm -  is the wrong reference again!
In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any n d matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the n rows that approximates their sum of squared distances to every k-dimensional affine subspace. An open theoretical problem has been to compute such a coreset that is independent of both n and d. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.
The authors modify the MCBoost criterion, in order to allow for multi-class boosting that is based on arbitrary number of dimensions (compared to a previous formulation that limits the number of dimensions to the number of classes). This lift of the limits in terms of dimensionality allows for a boosting-like framework that is comprised of controllable amount of boosting functions, and thus can be used as. The connection between MC-Boost and MV-SVM is interesting, and the discussion is good. Is the fact that both MC-SVM and MC-Boost try to maximise the margin well known? The authors present improved results in terms of error rate, and in terms of mAP.
In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through a combination of weak learners. We argue that the intermediate mapping, i.e. boosting predictor, is preserving the discriminant aspects of the data and that by controlling the dimension of this mapping it is possible to obtain discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification.
Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if they can be trusted in the real world where they may be faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.
Summary: The paper consider the setting of streaming PCA for time series data which contains two challenging ingredients: data stream dependence and a non-convex optimization manifold. The authors address this setting via downsampled version of Oja's algorithm. By closely inspecting the optimization manifold and using tools from the theory of stochastic differential equations, the authors provide a rather detailed analysis of the convergence behavior, along with confirming experiments on synthetic and real data. Evaluation: Streaming PCA is a fundamental setting in a topic which becomes increasingly important for the ML community, namely, time series analysis. Both data dependence and non-convex optimization are still at their anecdotal preliminary stage, and the algorithm and the analysis provided in the paper form an interesting contribution in this respect.