Blogs08 is a collection of blog posts and feeds used by the TREC Blog track 2009-. The larger brother of Terrier/Blogs06, it has very many more posts. It can be obtained from the University of Glasgow. Two tasks have been defined on the Blogs08 collection, namely blog distillation and top news story identification. The topics and qrels can be obtained from:
Terrier can index the permalinks (blog posts only) of the Blogs08 collection with very little changes:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO #remove some more non-content bearing tags TrecDocTags.skip=DOCHDR,FEEDNO,FEEDURL,BLOGHPNO,BLOGHPURL,PERMALINK,DATE_XML indexing.singlepass.max.postings.memory=500000000 indexer.meta.forward.keys=docno indexer.meta.forward.keylens=31 #if you want to lookup keys in reverse, disable for faster indexing indexer.meta.reverse.keys=docno
It takes about three days to index Blogs08 using single-pass indexing (trec_terrier.sh -i -j) on a single modern machine. Allow Terrier as much physical memory as possible, using TERRIER_HEAP_MEM environment variable.
If you wish URLs in your index, then set the following properties:
trec.collection.class=TRECWebCollection indexer.meta.forward.keys=docno,url indexer.meta.forward.keylens=31,256
See the Terrier documentation on Web-based Terrier to see how to build a Web search engine for this collection.