Terrier/Blogs06

Blogs06

Blogs06 is a collection of blog posts and feeds used by the TREC Blog track 2006-2008. It can be obtained from the [WWW] University of Glasgow. Two tasks have been defined on the Blogs06 collection, namely opinion finding and blog distillation. The topics and qrels can be obtained from:

Terrier can index the permalinks (blog posts only) of the Blogs06 collection with very little changes:


TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
#remove some more non-content bearing tags
TrecDocTags.skip=DOCHDR,FEEDNO,FEEDURL,BLOGHPNO,BLOGHPURL,PERMALINK,DATE_XML

indexing.singlepass.max.postings.memory=500000000
indexer.meta.forward.keys=docno
indexer.meta.forward.keylens=31
indexer.meta.reverse.keys=docno

If you wish URLs in your index, then set the following properties:

trec.collection.class=TRECWebCollection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=31,256

See the Terrier documentation on [WWW] Web-based Terrier to see how to build a Web search engine for this collection.

last edited 2011-06-14 19:49:42 by CraigMacdonald