Blogs06
Blogs06 is a collection of blog posts and feeds used by the TREC Blog track 2006-2008. It can be obtained from the University of Glasgow. Two tasks have been defined on the Blogs06 collection, namely opinion finding and blog distillation. The topics and qrels can be obtained from:
Terrier can index the permalinks (blog posts only) of the Blogs06 collection with very little changes:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO #remove some more non-content bearing tags TrecDocTags.skip=DOCHDR,FEEDNO,FEEDURL,BLOGHPNO,BLOGHPURL,PERMALINK,DATE_XML indexing.singlepass.max.postings.memory=500000000 indexer.meta.forward.keys=docno indexer.meta.forward.keylens=31 indexer.meta.reverse.keys=docno
If you wish URLs in your index, then set the following properties:
trec.collection.class=TRECWebCollection indexer.meta.forward.keys=docno,url indexer.meta.forward.keylens=31,256
See the Terrier documentation on Web-based Terrier to see how to build a Web search engine for this collection.