Information Retrieval
Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search
Mussmann, Stephen, Levy, Daniel, Ermon, Stefano
This is often a bottleneck in natural language processing and computer vision tasks when the output space is feasibly enumerable but very large. We propose a method to perform inference in log-linear models with sublinear amortized cost. Our idea hinges on using Gumbel random variable perturbations and a pre-computed Maximum Inner Product Search data structure to access the most-likely elements in sublinear amortized time.
Deep Character-Level Click-Through Rate Prediction for Sponsored Search
Edizel, Bora, Mantrach, Amin, Bai, Xiao
Predicting the click-through rate of an advertisement is a critical component of online advertising platforms. In sponsored search, the click-through rate estimates the probability that a displayed advertisement is clicked by a user after she submits a query to the search engine. Commercial search engines typically rely on machine learning models trained with a large number of features to make such predictions. This is inevitably requires a lot of engineering efforts to define, compute, and select the appropriate features. In this paper, we propose two novel approaches (one working at character level and the other working at word level) that use deep convolutional neural networks to predict the click-through rate of a query-advertisement pair. Specially, the proposed architectures only consider the textual content appearing in a query-advertisement pair as input, and produce as output a click-through rate prediction. By comparing the character-level model with the word-level model, we show that language representation can be learnt from scratch at character level when trained on enough data. Through extensive experiments using billions of query-advertisement pairs of a popular commercial search engine, we demonstrate that both approaches significantly outperform a baseline model built on well-selected text features and a state-of-the-art word2vec-based approach. Finally, by combining the predictions of the deep models introduced in this study with the prediction of the model in production of the same commercial search engine, we significantly improve the accuracy and the calibration of the click-through rate prediction of the production system.
Music, AI and Health: What is Music Information Retrieval?
Sync Project has made products like Sync Music Bot and Unwind using cutting-edge music technology with the aim to develop personalized music as medicine. Let's take a look at Music Information Retrieval, the technological "glue" that allows us to synthesize advances in AI, music production and licensed music distribution to derive insights into music listening behavior and physiology. When we listen to a song, we hear a coherent blend of numerous instruments to create melody, harmony, and rhythm. When a computer listens to that same song, it "hears" a binary code--digital data represented by either "0" or "1"--that corresponds to the complex waveform of the audio file. When we listen to a song, we can easily determine what the genre is, if it's a happy or sad song, what instruments are present.
Artificial Intelligence and the Future of Search Engines
It was not long ago that Artificial Intelligence (AI) was only in the realm of science fiction. Today, it has become a reality and is only growing more prominent in many different industries every day. This includes the internet as AI in search engine technology has been around for a few years. The algorithms used to rank pages have been affected considerably by AI already and that trend will continue into the foreseeable future. Currently, Google's RankBrain, an AI process used help set search engine rankings, is having a major impact which is only expected to expand.
To tackle Google's power, regulators have to go after its ownership of data
The problem with regulating technology companies is that, faced with tough new rules, they can eventually innovate their way out, often by switching to newer, unregulated technologies. The risk of targeted regulation informed by little other than economic doctrines might even be fuelling a corporate quest for eternal disruption: instead of surrendering to the regulators, technology firms prefer to abandon their old business model. It's through this lens that we should interpret the likely fallout from the €2.4bn fine imposed on Alphabet, Google's parent company, by the European commission. It arrives after a lengthy, seven-year investigation into whether the company abused its dominance to promote its own online shopping service above search results. The commission's case seems sound; the sad fate of small online retailers, unable to compete with Alphabet over the past decade, suggests as much.
Bolt: Accelerated Data Mining with Fast Vector Compression
Blalock, Davis W, Guttag, John V
Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization algorithm that can compress vectors over 12x faster than existing techniques while also accelerating approximate vector operations such as distance and dot product computations by up to 10x. Because it can encode over 2GB of vectors per second, it makes vector quantization cheap enough to employ in many more circumstances. For example, using our technique to compute approximate dot products in a nested loop can multiply matrices faster than a state-of-the-art BLAS implementation, even when our algorithm must first compress the matrices. In addition to showing the above speedups, we demonstrate that our approach can accelerate nearest neighbor search and maximum inner product search by over 100x compared to floating point operations and up to 10x compared to other vector quantization methods. Our approximate Euclidean distance and dot product computations are not only faster than those of related algorithms with slower encodings, but also faster than Hamming distance computations, which have direct hardware support on the tested platforms. We also assess the errors of our algorithm's approximate distances and dot products, and find that it is competitive with existing, slower vector quantization algorithms.
In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
Marchant, Neil G., Rubinstein, Benjamin I. P.
Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 83% without loss to estimate accuracy.
Digital Marketing Tips For Small Businesses 2015 - Booming
Today, Businesses Have More Ways – And Places – Than Ever To Market Themselves.Your Local Digital Marketing Strategy Should Specifically Target And Appeal To Potential Customers In Your Geographic Area. Many Local Companies Have Used Some Form Of Digital Marketing Online Even If They Are Not Aware Of It.This Is An Important Local Digital Marketing Tip For Any Business. But For Local Businesses, It Can Be Even More Essential. Customers Who Are Looking For A Restaurant, Store Or Other Local Business Are Likely To Do A Search On Their Phone Or Mobile Device. If You Don't Have A Mobile Optimized Site, Not Only Will It Be Difficult For Them To Interact With Your Site, But It Will Also Be Difficult For Them To Find It In The First Place. If You Want Local Customers, Either On Mobile Or Desktop, To Find You, You Have To Have A Comprehensive Search Strategy.
Query Complexity of Clustering with Side Information
Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, "do two elements $u$ and $v$ belong to the same cluster?". The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {\em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.
A Signaling Game Approach to Databases Querying and Interaction
McCamish, Ben, Termehchy, Arash, Touri, Behrouz
As most database users cannot precisely express their information needs, it is challenging for database management systems to understand them. We propose a novel formal framework for representing and understanding information needs in database querying and exploration. Our framework considers querying as a collaboration between the user and the database management system to establish a it mutual language for representing information needs. We formalize this collaboration as a signaling game, where each mutual language is an equilibrium for the game. A query interface is more effective if it establishes a less ambiguous mutual language faster. We discuss some equilibria, strategies, and the convergence in this game. In particular, we propose a reinforcement learning mechanism and analyze it within our framework. We prove that this adaptation mechanism for the query interface improves the effectiveness of answering queries stochastically speaking, and converges almost surely. We extend out results for the cases that the user also modifies her strategy during the interaction.