Differences between revisions 1 and 2
|Deletions are marked like this.||Additions are marked like this.|
|Line 22:||Line 22:|
|* ["Labrador"], developed locally by CraigMacdonald||* ["Labrador"], developed locally by CraigMacdonald. ["Labrador"] is the in-house crawler used by ["Terrier"] to support the development of various intranet search applications.|
HOW IT WORKS
Finding as many URLs/sites as possible to start with: Rootsets
Download speed is a balance between downloading as many pages as possible in a timely fashion, and overloading any one server.
Downloading documents from a website that the webmaster may not wish downloaded : see Robots.txt.
Queue ordering: BreadthFirstCrawling is generally thought to give best coverage, and is preferred to DepthFirstCrawling. However all documents on the Internet cannot be downloaded by a crawler, so GoogleCrawler and others may prioritise the URLFrontier, by PageRank, say. See Also CrawlingStrategies
CrawlersTraps: Crawler design need careful attention not to become 'stuck' in a site that continually generates new URLs, or appears to.
Continous or batch-mode crawling: does the crawler continually recrawl changing sites, or does it occasionally recrawl to build a new collection or update an existing one.
EXAMPLES OF CRAWLERS
GoogleCrawler, as used by Google.com
Mercator, as was used by Altavista
The Anatomy of a large-scale hypertextual Web search engine - L Page and S Brin describe the first iteration of the Google search engine, including the crawler
Mercator: A Scalable, Extensible Web Crawler - describes the Altavista crawler Mercator
SearchTools.com has a good 'Everything You Ever Needed to Know When Crawling' at http://www.searchtools.com/robots/robot-checklist.html
Robotstxt.org has information about each version of the robots.txt standard: http://www.robotstxt.org/wc/robots.html
Viewing Panoptic Search product's configuration options produces a good insight into the configuration options of a commercial crawler (and to a lesser extend http://www.spiderline.com/about/spider/).
Sriram Krishnan has an interesting blog post about his crawling experiences - http://dotnetjunkies.com/WebLog/sriram/archive/2004/10/10/28253.aspx
JunghooCho - slides on refresh characteristics of pages, and when to recrawl - http://oak.cs.ucla.edu/~cho/talks/2001/Defense.ppt
O'Reilly's 'HTTP - the Definitive Guide' has a more condensed checklist
O'Reilly's 'Spidering Hacks' is also fairly good, though has a fairly high bias towards screen scraping, and RSS feeds than traditional crawling for IR needs.
'Mining the Web' has the first chapter dedicated to crawling.
Fully distributed crawler (hierarchical p2p) : http://www.cs.berkeley.edu/~kubitron/courses/cs294-4-F03/slides/loo_krishnamurthy_cooper.pdf
Coverage: http://dollar.biz.uiowa.edu/~pant/Papers/bixmas.pdf - IE i should consider running evaluations to determine coverage.