Wednesday, May 7, 2008

Crawling is harder than it looks

The best paper award at WWW 2008 went to a paper on large-scale crawling titled "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov.

I have never been that interested in crawling, so I almost missed this talk. But, I am glad I didn't.

The paper is a fascinating look at the difficulty of doing very large scale web crawling, focusing on avoiding web spam and pathological link structures that could trap a crawler, making sure to not overwhelm crawled webservers, and using high performance techniques for duplicate detection.

Some extended excerpts:
The web has changed significantly since the days of early crawlers, mostly in the area of dynamically generated pages and web spam. With server-side scripts that can create infinite loops, high-density link farms, and unlimited number of hostnames, the task of web crawling has changed from simply doing a BFS scan of the WWW to deciding in real time which sites contain useful information and giving them higher priority as the crawl progresses.

The first performance bottleneck we faced was caused by the complexity of verifying uniqueness of URLs and their compliance with robots.txt. As N scales into many billions ... [prior] algorithms ... no longer keep up with the rate at which new URLs are produced by our crawler (i.e., up to 184K per second) ... [A] new technique called Disk Repository with Update Management (DRUM) ... can store large volumes of arbitrary hashed data on disk and implement very fast check, update, and check+update operations using bucket sort ... DRUM can be thousands of times faster than prior disk-based methods.

In order to determine the legitimacy of a given domain, we use a very simple algorithm based on the number of incoming links from assets that spammers cannot grow to infinity. Our algorithm, which we call Spam Tracking and Avoidance through Reputation (STAR), dynamically allocates the budget of allowable pages for each domain and all of its subdomains in proportion to the number of in-degree links from other domains.

We ran IRLbot on a [single] quad-CPU AMD Opteron 2.6 GHz server (16 GB RAM, 24-disk RAID-5) attached to a 1-gb/s link ... [for a] total active crawling span of 41.27 days. During this time, IRLbot attempted 7,606,109,371 connections and received 7,437,281,300 valid HTTP replies ... IRLbot ended up with N = 6,380,051,942 responses with status code 200 and content-type text/html.

The average download rate during this crawl was 319 mb/s (1,789 pages/s) with the peak 10-minute average rate of 470 mb/s (3,134 pages/s). The crawler received 143 TB of data, out of which 254 GB were robots.txt files, and transmitted 1.8 TB of HTTP requests. The parser processed 161 TB of HTML code.
At very large scale and from a single box, those are pretty remarkable crawling rates, 2k pages/second. The details of the bottlenecks they encountered and how they overcame them made this paper a quite enjoyable read.

I did end with a question about part of the work. The authors say (in 7.1 and Figure 6) that the probability of finding a unique page never dropped below .11 and, earlier (in 1.4) say that "we believe a good fraction of the 35B URLs not crawled in this experiment [contain] useful content." However, they define a unique page as the same URL, so it seems like it could easily be the case that those 35B URLs seen but not crawled could have duplicate or near duplicate content to the 6B crawled.

Update: One topic the IRLbot paper ignored was how frequently we should recrawl a page. Another WWW 2008 paper, "Recrawl Scheduling Based on Information Longevity" (PDF) by Chris Olston and Sandeep Pandey, has a nice overview of that issue and then extends the prior work by focusing on the persistence of a web page. The authors end up arguing that it is not worth recrawling pages that change very rapidly very frequently because it is impossible to ever get the index to match the actual content of the page.

Update: On the question of what content is useful, both in terms of crawling and recrawling, I always find myself wondering how much we should care about pages that never get visited by anyone. Shouldn't our focus in broadening a crawl beyond 6B pages and in recrawling pages in our index depend on how much users seem to be interested in the new content?

No comments:

Post a Comment