TREC Disk 4 & 5
TREC Disks 4 & 5 are the main adhoc TREC test collections that followed Disks 1 & 2.
Indexing Disks 4 & 5 is easy with Terrier. Only one property in terrier.properties needs to be altered from the default created by trec_setup, as follows:
#skip indexing some tags for these corpora TrecDocTags.process=TEXT,H3,DOCTITLE,HEADLINE,TTL
When indexing, we do not typically include the Congressional Record when indexing. See Query performance prediction, B.H He & I.Ounis, Information Systems 31(7), pp585--594, 2006. http://portal.acm.org/citation.cfm?id=1226381
If your copy of the collection is compressed with .gz extensions, then Terrier can read this fine.
On the other hand, if your copy of the collection is compressed with .Z extensions, you will need some additional configuration for Terrier: