Rootsets are the initial pages that a Crawler has been fed with to download. A simple crawl may only start with one page (eg www.gla.ac.uk/), although a more clever crawl may start with a page that has many outgoing links to many parts of a crawl (eg the OpenDirectoryProject - dmoz.org; or Yahoo Directory).
If a crawl is a repeat crawl, then ideally the crawl should be seeded with the logs from the previous crawl - ie it will attempt to download all the pages is discovered during the previous crawl. This minimises the DiscoveryPeriod - the time taken to discover a given host or a page during a crawl.
However, if you seed a large crawl with the URLs of a smaller crawl, then it is likely that you will destroy the BreadthFirstCrawling or high PageRank pattern. To explain, starting at any small point (eg www.gla.ac.uk/) is likely to crawl pages starting with a high importance first. In contrast, starting with a rootset of a previous smaller crawl, the URL ordering is destroyed, and more links will be discovered to lower priority sites.
In preparation for our .uk crawls, I have been investigating the following resources for rootset discovery:
DNS Zone transfer for .ac.uk; .uk - generally speaking not possible: See http://www.circleid.com/article/326_0_1_0_C/
Request of .co.uk database from Nominet; .ac.uk from JANET etc
Proxy server logs - privacy issues
Previous crawls (gla.ac.uk; glasgow unis.ac.uk; scot unis.ac.uk)
There has been some recent Slashdot discussion that MSN has been seeding their crawler with Google results for each domain - eg using site:domain.com to obtain URLs. With restricted domain crawls, this would give at most 1000 URLs. However, by repeating the site query for each distinct hostname would allow a rootset of 1000 URLs for each hostname in the domain. See http://slashdot.org/articles/04/11/11/1724221.shtml and http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20041111MicrosoftCrawlingGoogleResultsForNewSearchEngine.html. With my compare.pl scripts, this would be particularly easy to do, not just for Google, but also Yahoo etc. There are legality issues (ie in relation to each engine's Terms of Service), but for research use, this shouldn't be a huge issue.