# Communications

### Crowdsourcing via Pairwise Co-occurrences: Identifiability and Algorithms

The data deluge comes with high demands for data labeling. Crowdsourcing (or, more generally, ensemble learning) techniques aim to produce accurate labels via integrating noisy, non-expert labeling from annotators. The classic Dawid-Skene estimator and its accompanying expectation maximization (EM) algorithm have been widely used, but the theoretical properties are not fully understood. Tensor methods were proposed to guarantee identification of the Dawid-Skene model, but the sample complexity is a hurdle for applying such approaches---since the tensor methods hinge on the availability of third-order statistics that are hard to reliably estimate given limited data. In this paper, we propose a framework using pairwise co-occurrences of the annotator responses, which naturally admits lower sample complexity.

### Webinar: "How Conversational Intelligence Drives Better Business Outcomes"

In spite of the hype around omnichannel customer journeys, many businesses regard the phone as their primary channel for converting leads, making appointments and bringing in new customers. As communication paths between brands and customers change, call intelligence has morphed into "conversational intelligence." Marchex, a conversational intelligence solution provider focused on solving marketing and sales challenges for businesses, has developed an adaptive approach to applying both analytics and elements of AI to help identify deep consumer intent and recover "lost sales" in order to boost business outcomes. In this free webinar (Tuesday, December 10th), Opus Research and Marchex discuss how conversational intelligence can drive better business outcomes that can be measured and validated to convert leads, increase close rates and associated revenue increases. Don't miss this opportunity to boost sales outcomes with conversational intelligence.

Kaspersky researchers recently found malware in an app called CamScanner, a phone-based PDF creator that includes OCR (optical character recognition) and has more than 100 million downloads in Google Play. Various resources call the app by slightly different names such as CamScanner -- Phone PDF Creator and CamScanner-Scanner to scan PDFs. Official app stores such as Google Play are usually considered a safe haven for downloading software. Unfortunately, nothing is 100% safe, and from time to time malware distributors manage to sneak their apps into Google Play. The problem is that even such a powerful company as Google can't thoroughly check millions of apps.

### Efficient Online Inference for Bayesian Nonparametric Relational Models

Stochastic block models characterize observed network relationships via latent community memberships. In large social networks, we expect entities to participate in multiple communities, and the number of communities to grow with the network size. We introduce a new model for these phenomena, the hierarchical Dirichlet process relational model, which allows nodes to have mixed membership in an unbounded set of communities. To allow scalable learning, we derive an online stochastic variational inference algorithm. Focusing on assortative models of undirected networks, we also propose an efficient structured mean field variational bound, and online methods for automatically pruning unused communities.

### Learning Mixture of Gaussians with Streaming Data

In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of $N$ points in $d$ dimensions generated by an unknown mixture of $k$ spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are $C\sigma$ distant with $C \Omega((k\log k) {1/4}\sigma)$ and where $\sigma 2$ is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to certain constants); our center separation requirement matches the best known result for spherical Gaussians \citep{vempalawang}. For finite samples, we show that a bias term based on the initial estimate decreases at $O(1/{\rm poly}(N))$ rate while variance decreases at nearly optimal rate of $\sigma 2 d/N$. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm.

### Matrix Norm Estimation from a Few Entries

Singular values of a data in a matrix form provide insights on the structure of the data, the effective dimensionality, and the choice of hyper-parameters on higher-level data analysis tools. However, in many practical applications such as collaborative filtering and network analysis, we only get a partial observation. Under such scenarios, we consider the fundamental problem of recovering various spectral properties of the underlying matrix from a sampling of its entries. We propose a framework of first estimating the Schatten $k$-norms of a matrix for several values of $k$, and using these as surrogates for estimating spectral properties of interest, such as the spectrum itself or the rank. This paper focuses on the technical challenges in accurately estimating the Schatten norms from a sampling of a matrix.

### A primal-dual method for conic constrained distributed optimization problems

We consider cooperative multi-agent consensus optimization problems over an undirected network of agents, where only those agents connected by an edge can directly communicate. The objective is to minimize the sum of agent-specific composite convex functions over agent-specific private conic constraint sets; hence, the optimal consensus decision should lie in the intersection of these private sets. We provide convergence rates in sub-optimality, infeasibility and consensus violation; examine the effect of underlying network topology on the convergence rates of the proposed decentralized algorithms; and show how to extend these methods to handle time-varying communication networks. Papers published at the Neural Information Processing Systems Conference.

### Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing

The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor.

### Facebook Gives Workers a Chatbot to Appease That Prying Uncle

The answers were put together by Facebook's public relations department, parroting what company executives have publicly said. And the chatbot has a name: the "Liam Bot." (The provenance of the name is unclear.) "Our employees regularly ask for information to use with friends and family on topics that have been in the news, especially around the holidays," a Facebook spokeswoman said. "We put this into a chatbot, which we began testing this spring." Facebook's reputation has been shredded by a string of scandals -- including how the site spreads disinformation and can be used to meddle in elections -- in recent years.

### Deep Tech with Avrohom Gottheil

We are living in an era of information overload, where there is an overabundance of information, yet at the same time, it's hard to ascertain what's credible and what's not. Technology changes rapidly, and innovation is on the rise. How can we educate ourselves to know: (1) What products are available on the market? INTERVIEW HIGHLIGHTS: This episode of #AskTheCEO features a presentation Avrohom Gottheil gave in New Delhi, India for India's First Annual Deep Tech Summit, titled Deep Tech for All. "Time is the new currency, and that is what's driving the mass adoption of voice-based technology in the marketplace", said Avrohom [13:30] J. Dianne Dotson, Science Fiction Writer and Research Scientist, shares how in the future we will be able to leverage AI to search global DNA databases, such as 23 and me, and analyze people's genomes for disease-causing proteins so that we can disable them and stop diseases from spreading, right from the source.