The World Wide Web (WWW) abounds with ever-increasing information on many topics. However, since every user has specific information needs and interests, only a tiny part of the WWW is useful to them. For example, in a family, a mother may wish to "find recipes with salmon as the main ingredient", the father may be interested in "what movie to watch tonight?", and the teenage daughter may be wondering "what is artificial intelligence?". In order for humans to quickly ‘retrieve’ relevant information of interest, they usually search the Web using a search engine such as Google.
Although it sounds simple, information retrieval is a complex field involving many sub-tasks and applications. According to "the father of information retrieval", Gerard Salton, information retrieval is the field concerned with the tasks of structure, analysis, organization, storage, searching, and retrieval of information. Applications include, but are not limited to, web search (i.e., searching the WWW) which is the most common type, where the search is specialized in a specific topic only (e.g., searching for shoes within the football topic implies someone looking for football shoes), enterprise search, which involves searching for documents in a corporate intranet, image search, which is searching for images similar to a given image, product search, which involves searching for products similar to a given product, desktop search, which is searching for relevant files in our personal computer, or mobile search, which typically takes location and time into account. Users can be searching for different kinds of items, such as webpages, emails, scholarly papers, books, news stories, or even social profiles. Furthermore, with the advent of new technologies and modalities like virtual reality, it is likely that the scope of information retrieval will only increase with time.
Regardless of the type of search and the type of the returned item, the goal of every information retrieval algorithm is to take a search query as input, and to quickly find and output a ranked list of relevant items, i.e., items that contain information that the user was looking for. For example, in our family example, the mother may submit a query of the form "find recipes with salmon" and the expected result is an ordered (ranked) list of recipes containing salmon, ordered by how relevant each recipe is to the query. Although a straightforward approach would be for a retrieval algorithm to simply compare the query text with the recipe text, this approach will not always work due to language ambiguity. For example, when someone submits a query containing the single word "jaguar" it is very difficult for any algorithm to determine whether the user is looking for documents about jaguar the animal or jaguar the vehicle brand. To be effective, an information retrieval system needs to pay special attention to the meaning of queries rather than the actual words used in them.
Along with ambiguity, information retrieval faces a number of important challenges e.g., dealing with unstructured information, ensuring that it takes each user's context and expectations into account when returning the results, and dealing with scalability (e.g., search engines typically index and search almost instantly, billions of items, in order to answer each user's query, along with answering more than a trillion queries per year). Researchers are continuing to address these challenges.
- Pigi Kouki
When working in healthcare, a lot of the relevant information for making accurate predictions and recommendations is only available in free-text clinical notes. Much of this data is trapped in free-text documents in unstructured form. This data is needed in order to make healthcare decisions. Hence, it is important to be able to extract data in the best possible way such that the information obtained can be analyzed and used. State-of-the-art NLP algorithms can extract clinical data from text using deep learning techniques such as healthcare-specific word embeddings, named entity recognition models, and entity resolution models.
China's state-owned Xinhua News Agency introduced so-called "composite anchors" on Wednesday, combining the images and voices of human anchors with artificial intelligence (AI) technology. The new AI anchors, launched by Xinhua and Beijing-based search engine operator Sogou during the World Internet Conference in Wuzhen, can deliver the news with "the same effect" as human anchors because the machine learning programme is able to synthesise realistic-looking speech, lip movements and facial expressions, according to a Xinhua news report on Wednesday. "AI anchors have officially become members of the Xinhua News Agency reporting team. They will work with other anchors to bring you authoritative, timely and accurate news information in both Chinese and English," Xinhua said. The AI anchors are now available throughout Xinhua's internet and mobile platforms such as its official Chinese and English apps, WeChat public account, and online TV webpage.
Google announced official support for the Schema.org The speakable specification will help Google Assistant and Google Home choose which content to read aloud. This new structured data markup is important because it may point to what you'll need to know to get more traffic should/when Google expands this structured data to all websites. The support for this new markup is currently limited to News content. However, it is likely that support for the speakable attribute will inevitably expand as Google gains experience with this new structured data markup.
For many of our potential guests, planning a trip starts at the search engine. At Airbnb, we want our product to be painless to find for past guests, and easy to discover for new ones. Search engine optimization (SEO) is the process of improving our site -- and more specifically our landing pages--to ensure that when a traveller looks for accommodations for their next trip, Airbnb is one of the top results on their favorite search engine. Search engines such as Google, Yahoo, Naver, and Baidu deploy their own fleet of "bots" across the internet to build map of the web and scrape information, or "index", from the pages that they hit. When indexing pages and ranking them for specific search queries, search engines will take into account a variety of factors, including relevance, site performance, and authority.
Automatic document organization, topic extraction, information retrieval and filtering all have one thing in common. They require text clustering (sometimes also known as document clustering) to be done quickly and accurately. If you've never heard of text clustering, this post will explain what it is, what it does, and how its currently being used to aid businesses. We'll also briefly discuss how a business could employ text clustering too! First, let's define text clustering.
Google has started rolling out its new Discover feed to US users visiting google.com Google announced the new feed as part of its 20th anniversary revamp of search on mobile, which replaces today's clean blank page with a search box with many more suggestions, in line with the Google app for iOS and Android. Google wants the site to surface relevant information for users by predicting what they're interested in rather than waiting for users to type in a search term. As 9to5Google reports, Google's mobile search site now has a feed of cards with suggested content under a topic category with the Discover star icon. Clicking the topic displays more related articles and allows users to follow the topic.
Finding someone in a surveillance video could soon be as easy as Googling them. Descriptions of people of interest, such as a suspect or a missing person, are normally given in terms of their height, gender or clothing. But using this information to find a short woman wearing a red jacket in a video, say, often requires scanning hours of footage manually, which is no easy task. But a new search tool can do it automatically.
If you want to build your blog audience, you're going to have to get smarter with your content. According to Copyblogger, SEO is the most misunderstood topic online. But, SEO content isn't complicated, once you understand that people come first, before search algorithms. SEO firms make their money understanding these simple concepts. Thriving in your online business means that you must go beyond simply "writing content." Your content needs to accomplish two goals: first, appeal to the end-user (customers, clients, prospects, readers, etc.) and second, solve a particular problem. But, how do you create content that meets those goals? How do you create content that ranks well with Google and also persuades people? Don't worry if you can't afford an expensive SEO copywriter. You can do this following simple rules. And, that's what you're going to learn in this article. We all know what happens when you type a search query into a search engine and hit "enter": You get a list of search results that are relevant to your search term. Those results pages appear as a result of search engine optimization (SEO). In a nutshell, SEO is a method of optimizing (enhancing the effectiveness of) your content for the search engines, in order to help it rank higher than content from other sites that target the same search terms.
With Google's use of AI and machine-learning helping it pinpoint, more clearly than ever, the specific factors that satisfy search queries in different niches and contexts, brands and retailers are being encouraged to tightly tailor their search strategies. A new study, "Searchmetrics Google Ranking Factors 2018", reveals, for example, that high-ranking Google results for searches related to the'weight loss' niche are 4x more likely to have a video on the page than results for'financial planning' or'credit' niches. And that eCommerce sites in the'furniture' niche can get away with displaying nearly 28 images on a page (more than most other niches) and still rank highly despite the fact that more images can sometimes make pages load slower. According to Jordan Koene, Chief Evangelist, VP Professional Services, Google's use of sophisticated AI and machine-learning techniques, such as its RankBrain system, help it to better understand the real intention behind the words that searchers enter in the search box – and learn what types of web pages will satisfy individual searches. "Google now recognizes much more clearly if someone's searching online to buy a table, for instance, or needs personal finance advice or wants to learn weight loss exercises. And by tracking user signals such as how often certain results are clicked and how long people spend there, the search engine learns what factors – such as more or less images or text, or whether a site uses encryption to protect personal information entered by visitors are appropriate for satisfying searchers in individual niches."
Neyshabur and Srebro proposed Simple-LSH, which is the state-of-the-art hashing method for maximum inner product search (MIPS) with performance guarantee. We found that the performance of Simple-LSH, in both theory and practice, suffers from long tails in the 2-norm distribution of real datasets. We propose Norm-ranging LSH, which addresses the excessive normalization problem caused by long tails in Simple-LSH by partitioning a dataset into multiple sub-datasets and building a hash index for each sub-dataset independently. We prove that Norm-ranging LSH has lower query time complexity than Simple-LSH. We also show that the idea of partitioning the dataset can improve other hashing based methods for MIPS. To support efficient query processing on the hash indexes of the sub-datasets, a novel similarity metric is formulated. Experiments show that Norm-ranging LSH achieves an order of magnitude speedup over Simple-LSH for the same recall, thus significantly benefiting applications that involve MIPS.