Terrier/FIRE

FIRE Data

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Terrier is a great choice of retrieval system for FIRE. However, it needs careful configuration, and a few code changes. Below, we detail the code changes, and a recommended configuration.

Code Changes

Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append

To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1 (FIRE-tr2.2.1-v1.patch), and one for Terrier v3.0 (FIRE-tr3.0-v1.patch). To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling.

Example, Terrier 2:

patch -p0 < FIRE-tr2.2.1-v1.patch
make clean compile

Note that once this patch has been applied, we recommend that you rebuild your indices trec_terrier.sh -i.

Configuration Properties

Following are the changes in terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
trec.collection.class=TRECUTFCollection
string.use_utf=true
trec.encoding=utf-8

#for terrier 3
indexer.meta.forward.keylens=50
#for Terrier 2
docno.byte.length=50