Terrier/Blogs08

Blogs08 is a collection of blog posts and feeds used by the TREC Blog track 2009-. The larger brother of Terrier/Blogs06, it has very many more posts. It can be obtained from the [WWW] University of Glasgow. Two tasks have been defined on the Blogs08 collection, namely blog distillation and top news story identification. The topics and qrels can be obtained from:

Terrier can index the permalinks (blog posts only) of the Blogs06 collection with very little changes:


TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
#remove some more non-content bearing tags
TrecDocTags.skip=DOCHDR,FEEDNO,FEEDURL,BLOGHPNO,BLOGHPURL,PERMALINK,DATE_XML

indexing.singlepass.max.postings.memory=500000000
indexer.meta.forward.keys=docno
indexer.meta.forward.keylens=31
#if you want to lookup keys in reverse, disable for faster indexing
indexer.meta.reverse.keys=docno

It takes about three days to index Blogs08 using single-pass indexing (trec_terrier.sh -i -j) on a single modern machine. Allow Terrier as much physical memory as possible, using TERRIER_HEAP_MEM environment variable.