Rootsets are the initial pages that a Crawler has been fed with to download. A simple crawl may only start with one page (eg, although a more clever crawl may start with a page that has many outgoing links to many parts of a crawl (eg the OpenDirectoryProject -; or Yahoo Directory).

If a crawl is a repeat crawl, then ideally the crawl should be seeded with the logs from the previous crawl - ie it will attempt to download all the pages is discovered during the previous crawl. This minimises the DiscoveryPeriod - the time taken to discover a given host or a page during a crawl.

However, if you seed a large crawl with the URLs of a smaller crawl, then it is likely that you will destroy the BreadthFirstCrawling or high PageRank pattern. To explain, starting at any small point (eg is likely to crawl pages starting with a high importance first. In contrast, starting with a rootset of a previous smaller crawl, the URL ordering is destroyed, and more links will be discovered to lower priority sites.

In preparation for our .uk crawls, I have been investigating the following resources for rootset discovery:

last edited 2005-01-19 16:45:43 by CraigMacdonald