# Data Science

### Graphical Models for Inference with Missing Data

We address the problem of deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called Missingness Graphs' to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we define the notion of \textit{recoverability} which ensures that, for a given missingness-graph $G$ and a given query $Q$ an algorithm exists such that in the limit of large samples, it produces an estimate of $Q$ \textit{as if} no data were missing. We further present conditions that the graph should satisfy in order for recoverability to hold and devise algorithms to detect the presence of these conditions. Papers published at the Neural Information Processing Systems Conference.

### On clustering network-valued data

Community detection, which focuses on clustering nodes or detecting communities in (mostly) a single network, is a problem of considerable practical interest and has received a great deal of attention in the research community. While being able to cluster within a network is important, there are emerging needs to be able to \emph{cluster multiple networks}. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on trace of powers of the adjacency matrix.

### Alibaba Cloud releases AI algorithms to GitHub

Alibaba Cloud has released source codes of its machine learning Alink platform, which was responsible for powering its parent company's 11.11 Global Shopping Festival, to GitHub. According to the cloud-computing and machine-intelligence unit of Alibaba Group, the platform contains a collection of algorithms for processing machine-learning task data, like artificial intelligence (AI) driven customer services and product recommendations. Jia Yangqing, president and senior fellow of the data platform division at Alibaba Cloud Intelligence, said the release of Alink on GitHub provides developers with "robust big data and advanced machine learning skills". Alibaba's use of the platform included e-commerce sites like Tmall during 11.11 sales this year. Due to its listing, developers can now use the source code in their own solutions, with examples including predictions in real-time, personalised recommendations, statistical analysis and abnormality detection, according to a post on Alizila, a news hub for Alibaba Group.

### Practical Hash Functions for Similarity Estimation and Dimensionality Reduction

Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.

### Prior-free and prior-dependent regret bounds for Thompson Sampling

We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit than the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by $14 \sqrt{n K}$. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by $\frac{1}{20} \sqrt{n K}$. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.

### Bringing Media Analytics into View - IT Peer Network

Video content will become richer and more data-intensive as it evolves from HD to 4K to 360 and even 8K. Companies are moving these visual workloads to the cloud and edge in order to keep up with capacity, growth and service demands. With the emergence of edge computing and cloudified, 5G networks, organizations have an opportunity to deliver insights through artificial intelligence (AI) that complement new user experiences and are adaptable to the complexities of delivering video content to a global audience. Companies need a visual cloud and media analytics platform that is flexible enough to support changing business models and deployment options, software that enables rapid innovation, and hardware that can scale to provide a range of performance, all while being able to lower total cost of ownership and grow profitability. Intel launched the Intel Select Solutions for Visual Cloud to give companies an easier path towards successful content creation and delivery starting with the Intel Select Solution for Simulation and Visualization and Intel Select Solution for Visual Cloud Delivery Network.

### Alibaba Cloud Releases Machine Learning Algorithm Platform on Github

Alibaba Cloud, the data intelligence backbone of Alibaba Group, announced that the core codes of Alink, its self-developed algorithm platform, have been made available via open source on Github, the world's largest developer community. The platform offers a broad range of algorithm libraries that support both batch and stream processing, which is critical for machine learning tasks such as online product recommendation and intelligent customer services. Data analysts and software developers can access the codes on Github to build their own software, facilitating tasks such as statistics analysis, machine learning, real-time prediction, personalized recommendation and abnormality detection. "As a platform that consists of various algorithms combining learning in various data processing patterns, Alink can be a valuable option for developers looking for robust big data and advanced machine learning tools," said Yangqing Jia, President and Senior Fellow of Data Platform at Alibaba Cloud Intelligence. "As one of the top ten contributors to Github, we are committed to connecting with the open source community as early as possible in our software development cycles.

### Rapid Distance-Based Outlier Detection via Sampling

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. Papers published at the Neural Information Processing Systems Conference.

### Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data

We address the problem of deciding whether a causal or probabilistic query is estimable from data corrupted by missing entries, given a model of missingness process. We extend the results of Mohan et al, 2013 by presenting more general conditions for recovering probabilistic queries of the form P(y x) and P(y,x) as well as causal queries of the form P(y do(x)). We show that causal queries may be recoverable even when the factors in their identifying estimands are not recoverable. Specifically, we derive graphical conditions for recovering causal effects of the form P(y do(x)) when Y and its missingness mechanism are not d-separable. Finally, we apply our results to problems of attrition and characterize the recovery of causal effects from data corrupted by attrition.

### A Drifting-Games Analysis for Online Learning and Applications to Boosting

We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates.