Goto

Collaborating Authors

 Information Retrieval


Phoebe: A Learning-based Checkpoint Optimizer

arXiv.org Artificial Intelligence

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.


PhD Candidate for Fairness and Non-discrimination in Machine Learning for Information Retrieval and Recommendation

#artificialintelligence

Are you fascinated by the possibilities of machine learning systems and is it important to you that these technologies are used fairly? As a PhD Candidate, your research aims to answer the question how information retrieval systems based on machine learning can be used in a non-discriminatory and fair way. Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways.


USER: A Unified Information Search and Recommendation Model based on Integrated Behavior Sequence

arXiv.org Artificial Intelligence

Search and recommendation are the two most common approaches used by people to obtain information. They share the same goal -- satisfying the user's information need at the right time. There are already a lot of Internet platforms and Apps providing both search and recommendation services, showing us the demand and opportunity to simultaneously handle both tasks. However, most platforms consider these two tasks independently -- they tend to train separate search model and recommendation model, without exploiting the relatedness and dependency between them. In this paper, we argue that jointly modeling these two tasks will benefit both of them and finally improve overall user satisfaction. We investigate the interactions between these two tasks in the specific information content service domain. We propose first integrating the user's behaviors in search and recommendation into a heterogeneous behavior sequence, then utilizing a joint model for handling both tasks based on the unified sequence. More specifically, we design the Unified Information Search and Recommendation model (USER), which mines user interests from the integrated sequence and accomplish the two tasks in a unified way.


Google announces redesign of Search engine with more pictures and extra context about results

The Independent - Tech

Google has announced a new redesign of its search tools, making it more visual and adding in extra contextual information about its results. At its Search On event, the web giant also announced new features for Google Chrome and its Google Lens artificially-intelligent photo software. The main aesthetic change are visually browsable results, "for searches where you need inspiration" such as "pour painting ideas", Google says, which will surface a series of pictures at the top of search results without having to navigate to the Images tab. It will also bring in more contextual information, rolled out over the coming months, with a new'Things to know" section that includes "different dimensions people typically search for". For those searching how to paint with acrylics, for example, underneath the top result will be a series of drop-down results that include a step-by-step guide, tips, or style options.


The Best Ways to Optimize Your Content for SEO: The Ultimate Guide

#artificialintelligence

Search engine optimization is the process of driving traffic to a website through organic search results. This means that people are finding your content organically in search engines like Google, Yahoo, and Bing. Given that Google owns both YouTube and Gmail, it's no surprise that videos and emails are two big ways to rank for SEO. This comprehensive SEO guide will walk you through all the best tips to optimize your content for SEO. You'll learn how to build links, use keywords effectively, write engaging copy, create video content that attracts viewers, and more!


RAFT: A Real-World Few-Shot Text Classification Benchmark

arXiv.org Artificial Intelligence

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .


Synthetic Data Does Not Reliably Protect Privacy, Researchers Claim

#artificialintelligence

A new research collaboration between France and the UK casts doubt on growing industry confidence that synthetic data can resolve the privacy, quality and availability issues (among other issues) that threaten progress in the machine learning sector. Among several key points addressed, the authors assert that synthetic data modeled from real data retains enough of the genuine information as to provide no reliable protection from inference and membership attacks, which seek to deanonymize data and re-associate it with actual people. Furthermore, the individuals most at risk from such attacks, including those with critical medical conditions or high hospital bills (in the case of medical record anonymization) are, through the'outlier' nature of their condition, most likely to be re-identified by these techniques. 'Given access to a synthetic dataset, a strategic adversary can infer, with high confidence, the presence of a target record in the original data.' The paper also notes that differentially private synthetic data, which obscures the signature of individual records, does indeed protect individuals' privacy, but only by significantly crippling the usefulness of the information retrieval systems that use it.


Pull and Push - How Machines Deliver Text Data To Human

#artificialintelligence

In this blog post we'll take a look at how information is delivered to human beings by machines. There are in fact different strategies that identify not only the context of information retrieval, but also user intent and means of delivery. We'll look into what information retrieval is, how user intent defines the objective and how this objective is achieved by specific information delivery systems. Information Retrieval (IR) is the process of gaining knowledge from a source of data from the environment. This environment can be explored in several ways to obtain such information, depending on the its state and the state of the user.


Query Evaluation in DatalogMTL -- Taming Infinite Query Results

arXiv.org Artificial Intelligence

In this paper, we investigate finite representations of DatalogMTL. First, we introduce programs that have finite models and propose a toolkit for structuring the execution of DatalogMTL rules into sequential phases. Then, we study infinite models that eventually become constant and introduce sufficient criteria for programs that allow for such representation. We proceed by considering infinite models that are eventually periodic and show that such a representation encompasses all DatalogMTLFP programs, a widely discussed fragment. Finally, we provide a novel algorithm for reasoning over finite representable DatalogMTL programs that incorporates all of the previously discussed representations.


DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder

arXiv.org Artificial Intelligence

With the rapid development of artificial intelligence, conversational bots have became prevalent in mainstream E-commerce platforms, which can provide convenient customer service timely. To satisfy the user, the conversational bots need to understand the user's intention, detect the user's emotion, and extract the key entities from the conversational utterances. However, understanding dialogues is regarded as a very challenging task. Different from common language understanding, utterances in dialogues appear alternately from different roles and are usually organized as hierarchical structures. To facilitate the understanding of dialogues, in this paper, we propose a novel contextual dialogue encoder (i.e. DialogueBERT) based on the popular pre-trained language model BERT. Five self-supervised learning pre-training tasks are devised for learning the particularity of dialouge utterances. Four different input embeddings are integrated to catch the relationship between utterances, including turn embedding, role embedding, token embedding and position embedding. DialogueBERT was pre-trained with 70 million dialogues in real scenario, and then fine-tuned in three different downstream dialogue understanding tasks. Experimental results show that DialogueBERT achieves exciting results with 88.63% accuracy for intent recognition, 94.25% accuracy for emotion recognition and 97.04% F1 score for named entity recognition, which outperforms several strong baselines by a large margin.