Collaborating Authors

What is a Web Crawler? (In 50 Words or Less)


If you've spent time (or are simply curious) about online marketing, then the topic of search engine optimization (SEO) usually comes up. And, no discussion of search engine anythingis complete these days without some mention of Google.



This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). The actual version can crawl and download images from Google Search Engine and Flickr Search, throught the official APIs. More search engines will be added later (e.g: Bing, Yahoo...) This package can be used in different manners depending on what you want to do (a complete example can be found in Note in this case that, the program will consume 6 queries from you Google Search Engine. This program will read all .jpg,

Databook: Turning Big Data into Knowledge with Metadata at Uber


We needed to collect metadata information frequently in a scalable manner without blocking other crawler tasks; in order to do this, we deployed our crawlers to different machines, requiring effective coordination between crawlers in a distributed manner. We considered configuring Quartz in clustering mode for distributed scheduling (backed by MySQL). However, we faced two blockers that prevented us from implementing this solution: first, running Quartz in clustering mode on multiple machines requires Quartz clocks to be synced periodically, adding an external dependency, and second, we experienced constant MySQL connection instability after the schedulers started. As a result, we ruled out running Quartz in clustering mode.

Quantati - AI powered data extraction APIs Don't code another crawler Product Hunt


Hello hunters! 5 months ago we were working full time maintaining a price tracking platform scraping billions of pages yearly. At scale, websites change and crawlers break daily. Maintaining them is time consuming and painful. We set out to create smart crawlers that would work across any web page with no manual intervention. Couple of months later we are super excited to show you Quantati: a collection of API's to extract data from web pages automatically.

Reimagine knowledge discovery using Amazon Kendra's Web Crawler


When you deploy intelligent search in your organization, two important factors to consider are access to the latest and most comprehensive information, and a contextual discovery mechanism. Many companies are still struggling to make their internal documents searchable in a way that allows employees to get relevant information knowledge in a scalable, cost-effective manner. A 2018 International Data Corporation (IDC) study found that data professionals are losing 50% of their time every week--30% searching for, governing, and preparing data, plus 20% duplicating work. Amazon Kendra is purpose-built for addressing these challenges. Amazon Kendra is an intelligent search service that uses deep learning and reading comprehension to deliver more accurate search results.