Goto

Collaborating Authors

 Information Retrieval


Recommendation or Discrimination?: Quantifying Distribution Parity in Information Retrieval Systems

arXiv.org Machine Learning

Information retrieval (IR) systems often leverage query data to suggest relevant items to users. This introduces the possibility of unfairness if the query (i.e., input) and the resulting recommendations unintentionally correlate with latent factors that are protected variables (e.g., race, gender, and age). For instance, a visual search system for fashion recommendations may pick up on features of the human models rather than fashion garments when generating recommendations. In this work, we introduce a statistical test for "distribution parity" in the top-K IR results, which assesses whether a given set of recommendations is fair with respect to a specific protected variable. We evaluate our test using both simulated and empirical results. First, using artificially biased recommendations, we demonstrate the trade-off between statistically detectable bias and the size of the search catalog. Second, we apply our test to a visual search system for fashion garments, specifically testing for recommendation bias based on the skin tone of fashion models. Our distribution parity test can help ensure that IR systems' results are fair and produce a good experience for all users.


d-blink: Distributed End-to-End Bayesian Entity Resolution

arXiv.org Machine Learning

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink -- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains -- in excess of 300$\times$ -- when compared to existing non-distributed methods.


Google verticals, machine learning and no-click searches expected to have the biggest impacts on SEO - Search Engine Land

#artificialintelligence

Google entering verticals and competing directly against publishers, advancements in machine learning and AI and zero-click searches are the trends most likely to affect SEO in the next three years, according to a SparkToro survey of over 1,500 SEOs. Trends that are here to stay? Respondents were presented with a list of choices and asked, "How much of an impact do you believe the following trends will have on SEO in the next 3 years?" Options were ranked on a zero-to-four scale; zero meaning "no impact" and four meaning "huge impact." The trend that professionals responded were least likely to affect SEO included outcomes from US Congressional and Department of Justice investigations, visual search advances and "content-nudging" products such as Google Discover.


Accelerating Column Generation via Flexible Dual Optimal Inequalities with Application to Entity Resolution

arXiv.org Artificial Intelligence

In this paper, we introduce a new optimization approach to Entity Resolution. Traditional approaches tackle entity resolution with hierarchical clustering, which does not benefit from a formal optimization formulation. In contrast, we model entity resolution as correlation-clustering, which we treat as a weighted set-packing problem and write as an integer linear program (ILP). In this case sources in the input data correspond to elements and entities in output data correspond to sets/clusters. We tackle optimization of weighted set packing by relaxing integrality in our ILP formulation. The set of potential sets/clusters can not be explicitly enumerated, thus motivating optimization via column generation. In addition to the novel formulation, we also introduce new dual optimal inequalities (DOI), that we call flexible dual optimal inequalities, which tightly lower-bound dual variables during optimization and accelerate column generation. We apply our formulation to entity resolution (also called de-duplication of records), and achieve state-of-the-art accuracy on two popular benchmark datasets.


PhD in Computing Science: Emerging information retrieval challenges when processing real-time data streams at University of Glasgow on FindAPhD.com

#artificialintelligence

Eligibility: Full funding is provided for EU/UK students (standard home/EU fees and stipend rates included). Non-EU/UK students can apply, however they would be required to pay the difference between the home/EU and international fee. Funding is available to cover tuition fees for UK/EU applicants for 3 years, as well as paying a stipend at the Research Council rate (estimated ยฃ15,009 for Session 2019-20). FTE Category A staff submitted: 41.60


CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

arXiv.org Artificial Intelligence

It consists of 30k turns plus 10k annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets: (1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slot-value pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research. The dataset, baselines, and leaderboard will be released at https:// yale-lily.github.io/cosql .


Four ways you can use AI to optimize your AdWords campaigns - Search Engine Watch

#artificialintelligence

Artificial intelligence (AI) and machine learning algorithms are mainstreaming in a way that was never before possible, and these changes are having a significant influence on the way in which marketers need to approach search advertising. In addition to AdWords itself incorporating AI into its framework, new opportunities are arising that can give marketers an edge over their competitors, or automate lower-level tasks, freeing up more time for strategy. Here are four ways you can start taking advantage of AI to make the most of your AdWords campaigns. Automated machine learning as a solution to the decision of what price to bid on paid advertising is becoming an increasingly popular option as the necessary technologies become available to more firms. Bidding too low means missing out on opportunities to reach leads, while bidding too high means sacrificing ROI.


How A New Design And Content Helps Your Website To Rank Higher On Search Engines

#artificialintelligence

Are you looking for ways to increase your website ranking? Trying to figure out how to be on top of the google search engine pages? Well, the information below is going to be your best friend. The website is the heart of any business in this digital world. The most important part of traffic is from organic searches.


General Fragment Model for Information Artifacts

arXiv.org Artificial Intelligence

The use of semantic descriptions in data intensive domains require a systematic model for linking semantic descriptions with their manifestations in fragments of heterogeneous information and data objects. Such information heterogeneity requires a fragment model that is general enough to support the specification of anchors from conceptual models to multiple types of information artifacts. While diverse proposals of anchoring models exist in the literature, they are usually focused in audiovisual information. We propose a generalized fragment model that can be instantiated to different kinds of information artifacts. Our objective is to systematize the way in which fragments and anchors can be described in conceptual models, without committing to a specific vocabulary.


Endless AI-generated spam risks clogging up Google's search results

#artificialintelligence

Over the past year, AI systems have made huge strides in their ability to generate convincing text, churning out everything from song lyrics to short stories. Experts have warned that these tools could be used to spread political disinformation, but there's another target that's equally plausible and potentially more lucrative: gaming Google. Instead of being used to create fake news, AI could churn out infinite blogs, websites, and marketing spam. The content would be cheap to produce and stuffed full of relevant keywords. But like most AI-generated text, it would only have surface meaning, with little correspondence to the real world.