Diff for "Crawler"

Differences between revisions 1 and 2

Deletions are marked like this. Additions are marked like this.
Line 22: Line 22:
 * ["Labrador"], developed locally by CraigMacdonald  * ["Labrador"], developed locally by CraigMacdonald. ["Labrador"] is the in-house crawler used by ["Terrier"] to support the development of various intranet search applications.

Alternative names: Spider, Robot

DESCRIPTION

A crawler is primarily used in WebIR for retrieving documents from the Internet (primarily the WorldWideWeb) and saving to a collection, ready for an IR system to index.

HOW IT WORKS

Crawlers download web pages from the Internet, and extract the links from HTML, and queue these found URLS to be fetched (onto the URLFrontier).

ISSUES

  • Finding as many URLs/sites as possible to start with: Rootsets

  • Download speed is a balance between downloading as many pages as possible in a timely fashion, and overloading any one server.

  • Downloading documents from a website that the webmaster may not wish downloaded : see Robots.txt.

  • Queue ordering: BreadthFirstCrawling is generally thought to give best coverage, and is preferred to DepthFirstCrawling. However all documents on the Internet cannot be downloaded by a crawler, so GoogleCrawler and others may prioritise the URLFrontier, by PageRank, say. See Also CrawlingStrategies

  • CrawlersTraps: Crawler design need careful attention not to become 'stuck' in a site that continually generates new URLs, or appears to.

  • Continous or batch-mode crawling: does the crawler continually recrawl changing sites, or does it occasionally recrawl to build a new collection or update an existing one.

EXAMPLES OF CRAWLERS

FUNDAMENTAL ARCHITECURES

Data Structures:

FURTHER INFORMATION

Papers

  • The Anatomy of a large-scale hypertextual Web search engine - L Page and S Brin describe the first iteration of the Google search engine, including the crawler

  • Mercator: A Scalable, Extensible Web Crawler - describes the Altavista crawler Mercator

Websites

Books

  • O'Reilly's 'HTTP - the Definitive Guide' has a more condensed checklist

  • O'Reilly's 'Spidering Hacks' is also fairly good, though has a fairly high bias towards screen scraping, and RSS feeds than traditional crawling for IR needs.

  • 'Mining the Web' has the first chapter dedicated to crawling.

Recent Research

last edited 2005-03-17 19:35:18 by IadhOunis