Data Mining


On clustering network-valued data

Neural Information Processing Systems

Community detection, which focuses on clustering nodes or detecting communities in (mostly) a single network, is a problem of considerable practical interest and has received a great deal of attention in the research community. While being able to cluster within a network is important, there are emerging needs to be able to \emph{cluster multiple networks}. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on trace of powers of the adjacency matrix.


Alibaba Cloud releases AI algorithms to GitHub

#artificialintelligence

Alibaba Cloud has released source codes of its machine learning Alink platform, which was responsible for powering its parent company's 11.11 Global Shopping Festival, to GitHub. According to the cloud-computing and machine-intelligence unit of Alibaba Group, the platform contains a collection of algorithms for processing machine-learning task data, like artificial intelligence (AI) driven customer services and product recommendations. Jia Yangqing, president and senior fellow of the data platform division at Alibaba Cloud Intelligence, said the release of Alink on GitHub provides developers with "robust big data and advanced machine learning skills". Alibaba's use of the platform included e-commerce sites like Tmall during 11.11 sales this year. Due to its listing, developers can now use the source code in their own solutions, with examples including predictions in real-time, personalised recommendations, statistical analysis and abnormality detection, according to a post on Alizila, a news hub for Alibaba Group.


Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

Neural Information Processing Systems

Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.


Prior-free and prior-dependent regret bounds for Thompson Sampling

Neural Information Processing Systems

We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit than the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by $14 \sqrt{n K}$. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by $\frac{1}{20} \sqrt{n K}$. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.


Bringing Media Analytics into View - IT Peer Network

#artificialintelligence

Video content will become richer and more data-intensive as it evolves from HD to 4K to 360 and even 8K. Companies are moving these visual workloads to the cloud and edge in order to keep up with capacity, growth and service demands. With the emergence of edge computing and cloudified, 5G networks, organizations have an opportunity to deliver insights through artificial intelligence (AI) that complement new user experiences and are adaptable to the complexities of delivering video content to a global audience. Companies need a visual cloud and media analytics platform that is flexible enough to support changing business models and deployment options, software that enables rapid innovation, and hardware that can scale to provide a range of performance, all while being able to lower total cost of ownership and grow profitability. Intel launched the Intel Select Solutions for Visual Cloud to give companies an easier path towards successful content creation and delivery starting with the Intel Select Solution for Simulation and Visualization and Intel Select Solution for Visual Cloud Delivery Network.


Alibaba Cloud Releases Machine Learning Algorithm Platform on Github

#artificialintelligence

Alibaba Cloud, the data intelligence backbone of Alibaba Group, announced that the core codes of Alink, its self-developed algorithm platform, have been made available via open source on Github, the world's largest developer community. The platform offers a broad range of algorithm libraries that support both batch and stream processing, which is critical for machine learning tasks such as online product recommendation and intelligent customer services. Data analysts and software developers can access the codes on Github to build their own software, facilitating tasks such as statistics analysis, machine learning, real-time prediction, personalized recommendation and abnormality detection. "As a platform that consists of various algorithms combining learning in various data processing patterns, Alink can be a valuable option for developers looking for robust big data and advanced machine learning tools," said Yangqing Jia, President and Senior Fellow of Data Platform at Alibaba Cloud Intelligence. "As one of the top ten contributors to Github, we are committed to connecting with the open source community as early as possible in our software development cycles.


Rapid Distance-Based Outlier Detection via Sampling

Neural Information Processing Systems

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. Papers published at the Neural Information Processing Systems Conference.


A Drifting-Games Analysis for Online Learning and Applications to Boosting

Neural Information Processing Systems

We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates.


The Large Margin Mechanism for Differentially Private Maximization

Neural Information Processing Systems

A basic problem in the design of privacy-preserving algorithms is the \emph{private maximization problem}: the goal is to pick an item from a universe that (approximately) maximizes a data-dependent function, all under the constraint of differential privacy. This problem has been used as a sub-routine in many privacy-preserving algorithms for statistics and machine learning. Previous algorithms for this problem are either range-dependent---i.e., their utility diminishes with the size of the universe---or only apply to very restricted function classes. This work provides the first general purpose, range-independent algorithm for private maximization that guarantees approximate differential privacy. Its applicability is demonstrated on two fundamental tasks in data mining and machine learning.


Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization

Neural Information Processing Systems

Distributed learning allows a group of independent data owners to collaboratively learn a model over their data sets without exposing their private data. We present a distributed learning approach that combines differential privacy with secure multi-party computation. We explore two popular methods of differential privacy, output perturbation and gradient perturbation, and advance the state-of-the-art for both methods in the distributed learning setting. In our output perturbation method, the parties combine local models within a secure computation and then add the required differential privacy noise before revealing the model. In our gradient perturbation method, the data owners collaboratively train a global model via an iterative learning algorithm.