WT2G
WT2G is a general Web crawl, used by the TREC 1999 Web track. It can be obtained from the
University of Glasgow. The topics and qrels are available from the
TREC website.
Indexing the WT2G collection is easy with Terrier. No terrier.properties are required to be altered from the default created by trec_setup. If you wish URLs in your index, then set the following properties:
trec.collection.class=TRECWebCollection indexer.meta.forward.keys=docno,url indexer.meta.forward.keylens=26,256
See the Terrier documentation on
Web-based Terrier to see how to build a Web search engine for this collection.