Terrier/FIRE

FIRE Data

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Terrier is a great choice of retieval system for FIRE. However, it needs careful configuration, and a few code changes. Below, we detail the code changes, and a recommended configuration.

Code Changes

Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append

To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1, and one for Terrier v3.0. To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling.

Note that once this patch has been applied, we recommend that you rebuild your indices.

Configuration Properties

Following are the changes in terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
trec.collection.class=TRECUTFCollection
string.use_utf=true
trec.encoding=utf-8

#for terrier 3
indexer.meta.forward.keylens=50
#for Terrier 2
docno.byte.length=50