Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Terrier is a great choice of retieval system for FIRE. However, it needs careful configuration, and a few code changes. Below, we detail the code changes, and a recommended configuration.

Code Changes

Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append

To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1, and one for Terrier v3.0. To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling.

Example, Terrier 2:

patch -p0 < FIRE-tr2.2.1-v1.patch
make clean compile

Note that once this patch has been applied, we recommend that you rebuild your indices trec_terrier.sh -i.

Configuration Properties

Following are the changes in terrier.properties file:


#for terrier 3
#for Terrier 2

last edited 2011-06-14 19:53:21 by CraigMacdonald