Terrier/FIRE

FIRE Data

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Terrier is a great choice of retrieval system for FIRE. However, it needs careful configuration, and a few code changes for older versions of Terrier. Below, we detail the code changes, and a recommended configuration.

Code Changes

Terrier 3.5 has all the necessary code changes to support FIRE. However, for earlier versions (2.2.1 and 3.0), some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append

To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1 (FIRE-tr2.2.1-v1.patch), and one for Terrier v3.0 (FIRE-tr3.0-v1.patch). To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling. NB: No patches are required for Terrier 3.5.

Example, Terrier 2:

patch -p0 < FIRE-tr2.2.1-v1.patch
make clean compile

Note that once this patch has been applied, we recommend that you rebuild your indices trec_terrier.sh -i.

Configuration Properties

Following are the changes in terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
trec.encoding=utf-8


#for Terrier 2 and Terrier 3.0
trec.collection.class=TRECUTFCollection
string.use_utf=true

#for Terrier 3.5
trec.collection.class=TRECCollection
tokeniser=UTFTokeniser

#for Terrier 3 and 3.5
indexer.meta.forward.keylens=50
#for Terrier 2
docno.byte.length=50