Diff for "Terrier/DOTGOV2"

Differences between revisions 4 and 5

Deletions are marked like this. Additions are marked like this.
Line 17: Line 17:
ignore.low.idf.terms=false

DOTGOV2

DOTGOV2 is a Web crawl of the .gov US government websites, and was used by the TREC Terabyte track 2004-2006, and the Million query track 2007-2008. It can be obtained from the [WWW] University of Glasgow. The topics and qrels are available from the TREC website:

Indexing the DOTGOV2 collection is easy with Terrier. No terrier.properties are required to be altered from the default created by trec_setup. However, we recommend that indexing is made using the single-pass strategy (trec_terrier.sh -i -j). On a modern machine, indexing usually takes around 24 hours. Using a Hadoop MapReduce job, indexing can be dramatically sped-up.

If you wish URLs in your index, then set the following properties:

trec.collection.class=TRECWebCollection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=26,256
ignore.low.idf.terms=false

See the Terrier documentation on [WWW] Web-based Terrier to see how to build a Web search engine for this collection.

last edited 2015-06-04 21:30:17 by CraigMacdonald