Information Retrieval
User Model-Based Intent-Aware Metrics for Multilingual Search Evaluation
Drutsa, Alexey, Shutovich, Andrey, Pushnyakov, Philipp, Krokhalyov, Evgeniy, Gusev, Gleb, Serdyukov, Pavel
Despite the growing importance of multilingual aspect of web search, no appropriate offline metrics to evaluate its quality are proposed so far. At the same time, personal language preferences can be regarded as intents of a query. This approach translates the multilingual search problem into a particular task of search diversification. Furthermore, the standard intent-aware approach could be adopted to build a diversified metric for multilingual search on the basis of a classical IR metric such as ERR. The intent-aware approach estimates user satisfaction under a user behavior model. We show however that the underlying user behavior models is not realistic in the multilingual case, and the produced intent-aware metric do not appropriately estimate the user satisfaction. We develop a novel approach to build intent-aware user behavior models, which overcome these limitations and convert to quality metrics that better correlate with standard online metrics of user satisfaction.
Sources of data for Search Engine
We will mainly be focusing on various sources of data that you might have to fetch or be given to build a search engine in the first place. So, if you are just an enthusiast or you have to build a professional search engine from scratch, you have come to the right place! A search engine differs from objective to objective but the core functionality remains the same – information retrieval. Here are some of the sources of data that you might be given or you want to build a search engine for. At the heart they are all quite the same but they have quite different approaches to solving the same problem.
Tons of machine learning and data science resources that cost nothing
Tutorials, books, articles, data sets, certifications, you name it. All about data science, machine learning and related topics. You can find them with a simple keyword search: enter the keyword "free" in the DSC's search box, and here are the results. Below is a screenshot of the DSC search results page, for the keyword "free". It shows the top 6 results, out of dozens of highly relevant search results.
Spikes in search engine data predict when drugs will be recalled
Could internet searches identify dodgy drugs? A Microsoft researcher has trained an algorithm to predict whether a drug will be recalled, using queries made through Microsoft's Bing search engine. "We know that every once in a while there will be a batch of a pharmaceutical drug that will have something wrong about it," says Elad Yom-Tov at Microsoft Research in Israel. "People will start asking about that drug more often or more than they usually do." Pharmaceutical companies and regulators such as the US Food and Drug Administration (FDA) monitor drugs on the market to keep tabs on adverse effects and potential faulty batches.
ABOUT WEBSAYS - Websays
Websays is the result of 15 years of scientific investigations in Web Crawling, Automatic Learning and Text Analytics. Dr. Hugo Zaragoza, Websays' founder, is a worldwide expert in those technologies. He has worked more than 10 years as a lead researcher in Microsoft and Yahoo! in the United States, England and Spain. In 2010 Dr. Zaragoza founded Websays with the objective of applying the most cutting edge technology in information retrieval and data analytics, including various new patent pending technologies developed by Websays. Websays services focus on online reputation monitoring and social media marketing.
Microsoft researchers detect lung-cancer risks in web search logs - Next at Microsoft
Smoking cigarettes is the leading cause of lung cancer, the most common cause of cancer death in the world. But nearly 20 percent of lung-cancer diagnoses are made in people who are non-smokers. That means in addition to smoking, geographic, demographic and genetic factors play a role in the devastating disease. A project from Microsoft's research labs is exploring the feasibility of using anonymized web search data to learn more about lung-cancer risk factors and provide early warning to people who are candidates for disease screening. The findings, published Thursday in JAMA Oncology, extend research that team members published last June on the feasibility of using the text of questions people ask search engines to predict diagnoses of pancreatic cancer.
Google's Featured Snippets on Desktop Now Written By Artificial Intelligence - Search Engine Journal
According to an article published Tuesday on Wired, Google's featured snippets on desktop will now be rendered completely through artificial intelligence. What are being called "sentence compression algorithms" just went live today in desktop search only. These sentence compression algorithms, with the help from deep neural networks, are capable of combing through large amounts of data and text to extract just the information you're looking for. So when you ask a question like "what is the best Christmas movie?", Favorite films aside, what you can notice here is that the featured snippet cut through the entire article to render a quick list of exactly what was being searched for.
Google's Hand-Fed AI Now Gives Answers, Not Just Search Results
Ask the Google search app "What is the fastest bird on Earth?," and it will tell you. "Peregrine falcon," the phone says. "According to YouTube, the peregrine falcon has a maximum recorded airspeed of 389 kilometers per hour." That's the right answer, but it doesn't come from some master database inside Google. When you ask the question, Google's search engine pinpoints a YouTube video describing the five fastest birds on the planet and then extracts just the information you're looking for.
Prediction of Video Popularity in the Absence of Reliable Data from Video Hosting Services: Utility of Traces Left by Users on the Web
Drutsa, Alexey, Gusev, Gleb, Serdyukov, Pavel
With the growth of user-generated content, we observe the constant rise of the number of companies, such as search engines, content aggregators, etc., that operate with tremendous amounts of web content not being the services hosting it. Thus, aiming to locate the most important content and promote it to the users, they face the need of estimating the current and predicting the future content popularity. In this paper, we approach the problem of video popularity prediction not from the side of a video hosting service, as done in all previous studies, but from the side of an operating company, which provides a popular video search service that aggregates content from different video hosting websites. We investigate video popularity prediction based on features from three primary sources available for a typical operating company: first, the content hosting provider may deliver its data via its API; second, the operating company makes use of its own search and browsing logs; third, the company crawls information about embeds of a video and links to a video page from publicly available resources on the Web. We show that video popularity prediction based on the embed and link data coupled with the internal search and browsing data significantly improves video popularity prediction based only on the data provided by the video hosting and can even adequately replace the API data in the cases when it is partly or completely unavailable.
Automatic Arguments Construction — From Search Engine to Research Engine
Gutfreund, Dan (IBM Research) | Katz, Yoav (IBM Research) | Slonim, Noam (IBM Research)
While discussing a concrete controversial topic, most humans will find it challenging to swiftly raise a diverse set of convincing and relevant arguments. In this paper we present a system that, given a point of view about a controversial topic, automatically generates arguments supporting and contesting it. This is achieved by breaking the task of automatic argument construction into a pipeline of successive modules, each is responsible for a specific tangible task such as documents retrieval, identifying building blocks of arguments within a document, and analyzing whether these building blocks support or contest the point of view. By providing an interface for humans to interact and intervene at different points in the pipeline, we present an interactive research tool which, for a given topic and a corpus of documents such as Wikipedia or newspaper archive, provides a more comprehensive view and deeper insights than can be obtained using standard search engines.