The World Wide Web (WWW) abounds with ever-increasing information on many topics. However, since every user has specific information needs and interests, only a tiny part of the WWW is useful to them. For example, in a family, a mother may wish to "find recipes with salmon as the main ingredient", the father may be interested in "what movie to watch tonight?", and the teenage daughter may be wondering "what is artificial intelligence?". In order for humans to quickly ‘retrieve’ relevant information of interest, they usually search the Web using a search engine such as Google.
Although it sounds simple, information retrieval is a complex field involving many sub-tasks and applications. According to "the father of information retrieval", Gerard Salton, information retrieval is the field concerned with the tasks of structure, analysis, organization, storage, searching, and retrieval of information. Applications include, but are not limited to, web search (i.e., searching the WWW) which is the most common type, where the search is specialized in a specific topic only (e.g., searching for shoes within the football topic implies someone looking for football shoes), enterprise search, which involves searching for documents in a corporate intranet, image search, which is searching for images similar to a given image, product search, which involves searching for products similar to a given product, desktop search, which is searching for relevant files in our personal computer, or mobile search, which typically takes location and time into account. Users can be searching for different kinds of items, such as webpages, emails, scholarly papers, books, news stories, or even social profiles. Furthermore, with the advent of new technologies and modalities like virtual reality, it is likely that the scope of information retrieval will only increase with time.
Regardless of the type of search and the type of the returned item, the goal of every information retrieval algorithm is to take a search query as input, and to quickly find and output a ranked list of relevant items, i.e., items that contain information that the user was looking for. For example, in our family example, the mother may submit a query of the form "find recipes with salmon" and the expected result is an ordered (ranked) list of recipes containing salmon, ordered by how relevant each recipe is to the query. Although a straightforward approach would be for a retrieval algorithm to simply compare the query text with the recipe text, this approach will not always work due to language ambiguity. For example, when someone submits a query containing the single word "jaguar" it is very difficult for any algorithm to determine whether the user is looking for documents about jaguar the animal or jaguar the vehicle brand. To be effective, an information retrieval system needs to pay special attention to the meaning of queries rather than the actual words used in them.
Along with ambiguity, information retrieval faces a number of important challenges e.g., dealing with unstructured information, ensuring that it takes each user's context and expectations into account when returning the results, and dealing with scalability (e.g., search engines typically index and search almost instantly, billions of items, in order to answer each user's query, along with answering more than a trillion queries per year). Researchers are continuing to address these challenges.
- Pigi Kouki
Anyone who's spent even a little time on the internet lately may feel like there's a little too much "debate"--much of which descends into ad hominem insults. But this has not deterred IBM from pursuing "Project Debater," which was first tested on the world stage in San Francisco in June. Naturally, when you think of IBM and AI, you think of Watson, the program capable of answering natural-language questions that famously beat human contestants at Jeopardy! in 2011. IBM sees Project Debater as a natural successor to Watson. Both are, in turn, a natural progression from the search engine.
Note: By submitting this form, you agree to Third Door Media's terms. In this week's Search In Pictures, here are the latest images culled from the web, showing what people eat at the search engine companies, how they play, who they meet, where they speak, what toys they have and more. Note: By submitting this form, you agree to Third Door Media's terms. Have something to say about this article?
French startup Doctrine is raising a $11.6 million funding round (€10 million) from existing investors Otium Venture and Xavier Niel. Doctrine is building a search engine for court decisions and other legal texts. This is a key tool if you're a lawyer or you're working in the legal industry in general. There are now a thousand companies using the service. It currently costs around €129 per user per month.
Senzing, a new artificial intelligence-based (AI) software company, announced its Senzing software product to address the $14.37 billion financial fraud market. Senzing is an IBM spinout that has reinvented entity resolution, which senses who is who in real time across multiple big data sources. Senzing is disrupting the fraud solutions market by offering the first real-time, plug-and-play, AI entity resolution software product for fraud detection, insider threats and more. Now, any company can deploy Senzing to quickly and effectively detect bad actors in their big data. Senzing uses entity-centric learning and other unique techniques to pierce through falsified identities and networks to find criminals.
In the world of SEO, it's important to understand the system you're optimizing for. Another crucial areas to understand is machine learning. Now, the term "machine learning" gets thrown around a lot these days. But how does machine learning actually impact search and SEO? This article will explore everything you need to know about how search engines use machine learning.
As trading professionals face growing quantities of data sources to perform predictive analytics, finding new ways to efficiently make use of that information and identify critical data points that can drive trading strategies is a challenge. The recipient of this year's Best Big Data Analytics & Technology Provider award is software technology firm AlphaSense, which is on a mission to empower users with greater insights. AlphaSense provides investment and knowledge professionals with a search engine, enabling them to wade through large volumes of data and find information in seconds--instead of hours. Julie Craig, vice president of marketing at AlphaSense, says it aggregates millions of relevant documents and applies intelligent search technology to help customers find critical information almost instantaneously. For the judges, its services are what "make AlphaSense a winner," and they add that "it has its finger on the pulse of (hidden) disclosures."
The power of modern search engines is undeniable: you can summon knowledge from the internet at a moment's notice. There are many situations where search is relegated to strict keyword search, or when the objects aren't text, search may not be available. Furthermore, strict keyword search doesn't allow the user to search semantically, which means information is not as discoverable. Today, we share a reproducible, minimally viable product that illustrates how you can enable semantic search for arbitrary objects! Concretely, we will show you how to create a system that searches python code semantically -- but this approach can be generalized to other entities (such as pictures or sound clips).
Google has released new details about about its spam fighting efforts, revealing that more than 80% of hacked sites have been detected and removed from search results. The search giant plans to continue its efforts by working directly with popular content management systems to fight back against those who compromise forums and comment sections with spam. "Last year, we focused a great deal of effort on reducing the impact on users from hacked websites, and were able to detect and remove more than 80 percent of compromised sites from search results. We're also working closely with many providers of popular content management systems like WordPress and Joomla to help them fight spammers that abuse forums and comment sections." Here are some other notable stats from Google's recent announcement.