C.2.6 Discuss the use of parallel web-crawling


  • Size of the web grows, increasing the time it would take to download pages
  • To make this reasonable “it becomes imperative to parallelize the crawling process (Stanford)

Advantages

  • Scalability: as the web grows a single process can not handle everything Multithreaded processing can solve the problem
  • Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
  • Network load reduction ( scalability, efficiency and throughput )

Issues of parallel web crawling

  • Overlapping: parallel web crawlers might index the same page multiple times
  • Quality: If a crawler wants to download ‘important’ pages first, this might not work in a parallel process
  • Communication bandwidth: parallel crawlers need to communicate for the former reasons, which for many processes might take significant communication bandwidth . Why search engines take the quality approach click here
  • If parallel crawlers request the same page frequently over a short time it will overload servers

Discuss the use of parallel web crawling

A crawler is a program that downloads and stores Web pages, often for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache. As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized (reference http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.8408&rep=rep1&type=pdf )

Why search engines take the quality approach ( dated )

According to a study released in October 2000, the directly accessible "surface web" consists of about 2.5 billion pages, while the "deep web" (dynamically generated web pages) consists of about 550 billion pages, 95% of which are publicly accessible [LVDSS00].

By comparison, the Google index released in June 2000 contained 560 million full-text-indexed pages [Goo00]. In other words, Google — which, according to a recent measurement [HHMN00], has the greatest coverage of all search engines — covers only about 0.1% of the publicly accessible web, and the other major search engines do even worse.

Increasing the coverage of existing search engines by three orders of magnitude would pose a number of technical challenges, both with respect to their ability to discover, download, and index web pages, as well as their ability to serve queries against an index of that size. (For query engines based on inverted lists, the cost of serving a query is linear to the size of the index.) Therefore, search engines should attempt to download the best pages and include (only) them in their index.

Search Methods

Breadth-first, Depth First, Backlink count, PageRank and Random

Mercator is an extensible, multithreaded, high-performance web crawler [HN99, Mer00]. It is written in Java and is highly configurable. Its default download strategy is to perform a breadth-first search of the web, with the following three modifications:

  1. It downloads multiple pages (typically 500) in parallel. This modification allows us to download about 10 million pages a day; without it, we would download well under 100,000 pages per day.
  2. Only a single HTTP connection is opened to any given web server at any given time. This modification is necessary due to the prevalence of relative URLs on the web (about 80% of the links on an average web page refer to the same host), which leads to a high degree of host locality in the crawler's download queue. If we were to download many pages from the same host in parallel, we would overload or even crash that web server.
  3. If it took t seconds to download a document from a given web server, then Mercator will wait for 10t seconds before contacting that web server again. This modification is not strictly necessary, but it further eases the load our crawler places on individual servers on the web. We found that this policy reduces the rate of complaints we receive while crawling.

    Further Reading Click Here

    Leave a Comment: