Diff for "Terrier/Disks4&5"

Differences between revisions 9 and 10

Deletions are marked like this. Additions are marked like this.
Line 14: Line 14:
## Compressed Files === Compressed Files ===
Line 18: Line 18:
On the other hand, if your copy of the collection is compressed with .Z extensions, you will need some additional configuration for Terrier: On the other hand, if your copy of the collection is compressed with .Z extensions, you will need some additional configuration for Terrier (version 5.2 onwards):

TREC Disk 4 & 5

TREC Disks 4 & 5 are the main adhoc TREC test collections that followed Disks 1 & 2.

Indexing Disks 4 & 5 is easy with Terrier. Only one property in terrier.properties needs to be altered from the default created by trec_setup, as follows:

#skip indexing some tags for these corpora
TrecDocTags.process=TEXT,H3,DOCTITLE,HEADLINE,TTL

When indexing, we do not typically include the Congressional Record when indexing. See Query performance prediction, B.H He & I.Ounis, Information Systems 31(7), pp585--594, 2006. [WWW] http://portal.acm.org/citation.cfm?id=1226381

Compressed Files

If your copy of the collection is compressed with .gz extensions, then Terrier can read this fine.

On the other hand, if your copy of the collection is compressed with .Z extensions, you will need some additional configuration for Terrier (version 5.2 onwards):

terrier.mvn.coords=org.apache.commons:commons-compress:1.18
files.mappings=Z:org.apache.commons.compress.compressors.z.ZCompressorInputStream:null

last edited 2019-04-29 16:05:30 by CraigMacdonald