Alternative names: Spider, Robot
DESCRIPTION
A crawler is primarily used in WebIR for retrieving documents from the Internet (primarily the WorldWideWeb) and saving to a collection, ready for an IR system to index.
HOW IT WORKS
Crawlers download web pages from the Internet, and extract the links from HTML, and queue these found URLS to be fetched (onto the URLFrontier).
ISSUES
Finding as many URLs/sites as possible to start with: Rootsets
Download speed is a balance between downloading as many pages as possible in a timely fashion, and overloading any one server.
Downloading documents from a website that the webmaster may not wish downloaded : see Robots.txt.
Queue ordering: BreadthFirstCrawling is generally thought to give best coverage, and is preferred to DepthFirstCrawling. However all documents on the Internet cannot be downloaded by a crawler, so GoogleCrawler and others may prioritise the URLFrontier, by PageRank, say. See Also CrawlingStrategies
CrawlersTraps: Crawler design need careful attention not to become 'stuck' in a site that continually generates new URLs, or appears to.
Continous or batch-mode crawling: does the crawler continually recrawl changing sites, or does it occasionally recrawl to build a new collection or update an existing one.
EXAMPLES OF CRAWLERS
GoogleCrawler, as used by Google.com
Mercator, as was used by Altavista
Labrador, developed locally by CraigMacdonald. Labrador is the in-house crawler used by Terrier to support the development of various intranet search applications.
FUNDAMENTAL ARCHITECURES
Data Structures:
URLFrontier
URLSeen
Robots.txt cache
FURTHER INFORMATION
Papers
The Anatomy of a large-scale hypertextual Web search engine - L Page and S Brin describe the first iteration of the Google search engine, including the crawler
Mercator: A Scalable, Extensible Web Crawler - describes the Altavista crawler Mercator
Websites
SearchTools.com has a good 'Everything You Ever Needed to Know When Crawling' at
http://www.searchtools.com/robots/robot-checklist.html Robotstxt.org has information about each version of the robots.txt standard:
http://www.robotstxt.org/wc/robots.html Viewing Panoptic Search product's configuration options produces a good insight into the configuration options of a commercial crawler (and to a lesser extend
http://www.spiderline.com/about/spider/). Sriram Krishnan has an interesting blog post about his crawling experiences -
http://dotnetjunkies.com/WebLog/sriram/archive/2004/10/10/28253.aspx JunghooCho - slides on refresh characteristics of pages, and when to recrawl -
http://oak.cs.ucla.edu/~cho/talks/2001/Defense.ppt
Books
O'Reilly's 'HTTP - the Definitive Guide' has a more condensed checklist
O'Reilly's 'Spidering Hacks' is also fairly good, though has a fairly high bias towards screen scraping, and RSS feeds than traditional crawling for IR needs.
'Mining the Web' has the first chapter dedicated to crawling.
Recent Research
Fully distributed crawler (hierarchical p2p) :
http://www.cs.berkeley.edu/~kubitron/courses/cs294-4-F03/slides/loo_krishnamurthy_cooper.pdf Coverage:
http://dollar.biz.uiowa.edu/~pant/Papers/bixmas.pdf - IE i should consider running evaluations to determine coverage.