Rootsets

Rootsets are the initial pages that a Crawler has been fed with to download. A simple crawl may only start with one page (eg www.gla.ac.uk/), although a more clever crawl may start with a page that has many outgoing links to many parts of a crawl (eg the OpenDirectoryProject - dmoz.org; or Yahoo Directory).

If a crawl is a repeat crawl, then ideally the crawl should be seeded with the logs from the previous crawl - ie it will attempt to download all the pages is discovered during the previous crawl. This minimises the DiscoveryPeriod - the time taken to discover a given host or a page during a crawl.

However, if you seed a large crawl with the URLs of a smaller crawl, then it is likely that you will destroy the BreadthFirstCrawling or high PageRank pattern. To explain, starting at any small point (eg www.gla.ac.uk/) is likely to crawl pages starting with a high importance first. In contrast, starting with a rootset of a previous smaller crawl, the URL ordering is destroyed, and more links will be discovered to lower priority sites.

In preparation for our .uk crawls, I have been investigating the following resources for rootset discovery:

last edited 2005-01-19 16:45:43 by CraigMacdonald