Goto

Collaborating Authors

 Information Retrieval


A Comparison of Approaches for Imbalanced Classification Problems in the Context of Retrieving Relevant Documents for an Analysis

arXiv.org Machine Learning

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder, 2017), the Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g.


How Forte Transforms the Building of NLP Solution with PyTorch into Assembly Lines

#artificialintelligence

Forte introduces "DataPack", a standardized data structure for unstructured data, distilling good software engineering practices such as reusability, extensibility, and flexibility into PyTorch-based ML solutions. Machine Learning (ML) technologies are now widely used in many day-to-day applications. For example, the systems behind personal assistants like Siri or Alexa are grounded in complex ML technologies, such as Natural Language Processing, Computer Vision, and many more. While the consumer interface of Machine Learning systems may appear simple, the systems behind the scene can be much more complex than they first appear. For example, building an intelligent medical information retrieval system requires one to stitch together a diverse set of techniques.


Counterfactual Learning To Rank for Utility-Maximizing Query Autocompletion

arXiv.org Machine Learning

Conventional methods for query autocompletion aim to predict which completed query a user will select from a list. A shortcoming of this approach is that users often do not know which query will provide the best retrieval performance on the current information retrieval system, meaning that any query autocompletion methods trained to mimic user behavior can lead to suboptimal query suggestions. To overcome this limitation, we propose a new approach that explicitly optimizes the query suggestions for downstream retrieval performance. We formulate this as a problem of ranking a set of rankings, where each query suggestion is represented by the downstream item ranking it produces. We then present a learning method that ranks query suggestions by the quality of their item rankings. The algorithm is based on a counterfactual learning approach that is able to leverage feedback on the items (e.g., clicks, purchases) to evaluate query suggestions through an unbiased estimator, thus avoiding the assumption that users write or select optimal queries. We establish theoretical support for the proposed approach and provide learning-theoretic guarantees. We also present empirical results on publicly available datasets, and demonstrate real-world applicability using data from an online shopping store.


WriterZen Review - Keyword Research & AI Copywriting Tool

#artificialintelligence

Are you overwhelmed at all the things you need to accomplish to rank in search engines? WriterZen allows you to plan a strategy from topic discovery to keyword research, all the way to writing the content and checking for plagiarism. In this WriterZen review, you'll see what WriterZen is, how it works, and its features, and by the end of this article, you should know if WriterZenis right for you. WriterZen is a complete SEO package that can help you map out a strategy for your SEO. Its set of tools was designed to help you write articles that rank on any search engine, be it Google, Yahoo, Bing, or YouTube.


Distributed Reconstruction of Noisy Pooled Data

arXiv.org Machine Learning

In the pooled data problem we are given a set of $n$ agents, each of which holds a hidden state bit, either $0$ or $1$. A querying procedure returns for a query set the sum of the states of the queried agents. The goal is to reconstruct the states using as few queries as possible. In this paper we consider two noise models for the pooled data problem. In the noisy channel model, the result for each agent flips with a certain probability. In the noisy query model, each query result is subject to random Gaussian noise. Our results are twofold. First, we present and analyze for both error models a simple and efficient distributed algorithm that reconstructs the initial states in a greedy fashion. Our novel analysis pins down the range of error probabilities and distributions for which our algorithm reconstructs the exact initial states with high probability. Secondly, we present simulation results of our algorithm and compare its performance with approximate message passing (AMP) algorithms that are conjectured to be optimal in a number of related problems.


How to Increase Your Google Page Speed Score

#artificialintelligence

How many times has your website taken a while to load? How many times have you said, "Meh. Your Google page speed score and your core web vitals are more important than ever. Even if you're making sales right now, it's only a matter of time before your competition decides it's better to be the hare and not the tortoise. All of the great content, social media promotion, and keyword research in the world won't matter if your website is a slug on a rainy day.


Search Engines are Missing Infected Sites, Putting Businesses At Risk

#artificialintelligence

We've all come across warnings when visiting suspicious websites. Your browser or search engine might even block you from entering, displaying a message that this site may harm your device. But what if the site you're trying to visit is not flagged as malicious? According to SiteLock's 2022 Security Report, 92% of infected websites are not blacklisted by search engines. This means that businesses and individuals are vulnerable to attack when they visit these sites.


Breaking Down and Interpreting Human Language -- NLP

#artificialintelligence

From translation software, chatbots, spam filters, and search engines, to grammar correction software, voice assistants, and social media monitoring tools, NLP is at the core of tools in our everyday life. NLP -- Natural Language Processing trying to make machines that can think and act like humans (Don't worry they won't be Human as humans are). It is used to understand human behavior by feeding it with syntax, language, accents, and many other forms of sensory data that human captures. Algorithms then convert this data, rather say transforms this data in the language that the machine understands, thus making the machine learn on a certain rule to perform actions and solve problems. So How Does NLP Work?


8 Best SQL Courses on Coursera

#artificialintelligence

If you want to gain the skills necessary to query big data with modern distributed SQL engines, then this specialization is for you. The best part of this course is that it will teach you a newer breed of SQL engine: distributed query engines Hive and Impala. Hive and Impala are open-source SQL engines capable of querying enormous datasets. Another advantage of this specialization program is that this program provides excellent preparation for the Cloudera Certified Associate (CCA) Data Analyst certification exam. This Specialization program consists of 3 Courses.


The Download: Chatbots could one day replace search engines. Here's why that's a terrible idea.

MIT Technology Review

The world's oceans are amazing carbon sponges, capturing a quarter of human-produced carbon dioxide when surface waters react with the greenhouse gas in the air or marine organisms gobble it up as they grow. Some research groups and start-ups want to help accelerate this natural process by adding certain minerals to the oceans that could help them lock up even more carbon and slow climate change. The idea has attracted a lot of excitement and investment. However, a number of recent studies suggest that some of these approaches may not be as effective as scientists had hoped. That's disappointing news, because the world may need to suck up an additional 10 billion tons of carbon annually by midcentury to limit warming to 2 C, according to a recent report.